The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

The Microarchitecture of FPGA-Based Soft Processors

Peter Yiannacouras

CARG - June 14, 2005

FPGA vs ASIC Flows

CircuitDesign

ASIC Flow FPGA Flow

CircuitDesign

Reduced cost for low-volume

Reduced time-to-market

Programmability affords customization

Designers use FPGAs!

Processors and FPGAs

Custom Logic Processor

FPGA


Increased board area, cost, and latency

□ Option 1: Off-chip processor


FPGA

Specialized part, lack of flexibility

□ Option 2: On-chip “hard” processor


FPGA

Can implement any number of processors

Tune each one to meet design constraints

□ Option 3: On-chip “soft” processor


Tuning Processors

Application,Design constraints

• $3• 4 MHz• 800 mW• 2-stage pipeline

• $300• 3.8 GHz• 80 W• 31-stage pipeline

Application,Design constraints• 500 LEs

• 40 MHz• 2-stage pipeline

• 1700 LEs• 160 MHz• 6-stage pipeline

Tuning Soft Processors

Application,Design constraints• 500 LEs

• 40 MHz• 2-stage pipeline

• 1700 LEs• 160 MHz• 6-stage pipeline

• your area, speed, power tradeoff

Automatically Tuning Soft Processors

Understanding Soft Processors Tuning requires

understanding of soft processor design space

We implement many processors and study the design space

ArchitectureDescription

SynthesizedProcessor

• Area• Performance• Power

Don’t we already understand architecture? Not completely

We can evaluate area, power, performance

Not accurately (rules of thumb) FPGA CAD tools are very accurate

Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft

processor(s)

Measurement Methodology Require a set of metrics

Area

Performance

Power

FPGA Flow

CircuitDesign (RTL)

• Resource Usage• Clock Frequency• Power estimate

AreaLogic Elements (LEs – LUT & flip flop)

Multipliers

Big RAM

Little RAM

Medium RAM

Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)

Performance Wall Clock Time = #Cycles * Clock Period

CAD Tool

dct, golRATEs

bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc

Dhrystone 2.1Freescale

bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench

BenchmarkSource

From RTLSimulation,Averaged over 20 benchmarks:

Power CAD tool can estimate power from

assumed toggle ratio (derived experimentally)

Total DynamicPower (mW)

÷ Clock Frequency (MHz)

=Dynamic Energyexcluding I/O per cycle (nJ/cycle)

Metrics summary Require the following information

1. Resource Usage (area – CAD Tool)2. Clock Frequency (wall clock time – CAD Tool)3. Power Estimate (energy/cycle – CAD Tool)4. Cycle Count (wall clock time – RTL Simulator)

RTL-based Design Space Exploration

Complete and accurate understanding of design space

CircuitDesign (RTL)

3. Area4. Clock Frequency5. Power

1. Correctness2. Cycle Count

CADTool

RTLSimulator

Benchmarks

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft

processor(s)

Microarchitectural Design Space Exploration

Need fast route to RTL from architectural idea

CircuitDesign (RTL)



CADTool

RTLSimulator

Benchmarks

SPREE (Soft Processor Rapid Exploration Environment)



CAD ToolRTL Simulator

Benchmarks

SPREERTL Generator

Goals

1. Develop measurement methodology2. Populate the design space

1. Rapidly2. With interesting designs3. Accurately (minimize overhead)

3. Compare against industrial soft processor(s)

SPREE

Related Work Parametrized Cores

Narrow design space, laborious changes to control

Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or

behavioural RTL) PEAS-III/ASIPMeister [Itoh2000]

non-fpga specific, ISA design focus

SPREE RTL Generator Overview

SPREERTL Generator

ComponentLibrary

ISA Description Datapath Description

EfficientlySynthesizable

RTL

InterestingAllows for interesting architectures

Rapidlysimple descriptions

Accuratelyefficient componentimplementations

Some current limitations No caches (use fast on-chip RAM) Simple in-order issue pipelines No dynamic branch prediction No OS or exceptions support

No ISA changes! Need compiler generation to support Use subset of MIPS-I

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Architecture Input

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Component Library

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Architecture Input

Component Library

Mul

Ifetch Regfile

ALU WriteBack

DataMem

Datapath Description

Architecture Input

SPREERTL GeneratorMul

Ifetch Reg File

ALU WriteBack

Mul

Ifetch Reg File

ALU WriteBack

DataMem

Mul

IF

Regfile

ALU WriteBack

Data MemISA Description

Datapath Description

Component Library

Mul

IF

Regfile

ALU WriteBack

Data Mem

Decode Decode Decode

• Control generation savestime and is non-critical

Architecture Input:ISA Description

Generic Operations (GENOPs) MIPS instructions made of GENOPs

FETCH

RFREAD

ADD

RFWRITE

GENOPs MIPS ADD – add rd, rs, rt

FETCH

RFREAD

ADD

RFWRITE

RFREAD

Complete Experimental Framework Using SPREE



CAD ToolRTL Simulator

Benchmarks

SPREERTL Generator

ComponentLibrary

ISA Description Datapath DescriptionFIXED

Goals

1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft processor(s)

SPREE

Area

Performance

Power

Altera’s NiosII Second generation soft processor Has three variations:

NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction

Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA

Design Space vs NiosII Variations

1000

2000

3000

4000

5000

6000

7000

8000

9000

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

Av

era

ge

Wa

ll C

loc

k T

ime

(u

s)

Generated Designs

Altera NiosIIe

Altera NiosIIs

Altera NiosIIf

Summary

1. We span the design space2. Remain competitive

Achieved 9% faster and 11% smaller than NiosIIs

=> don’t suffer from prohibitive overhead

Let’s explore some architecture!

Architectural Axes

1. Hardware vs Software Multiplication2. Shifter implementation3. Pipeline

Depth Organization Forwarding

Hardware vs Software Multiplication

Hardware multiplication Increases area & power consumption Speeds up execution

BUT … Not all applications care about speed Not all applications use multiplication

(significantly)

Cycle Count Speedup of Hardware Multiplication

1.01

1.03

1.04 1.

39

2.72 3.00

4.53

6.94

7.87

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

dijk

stra

dhry

qsor

t

fir

FF

T

dct

quan

t

fft

iqua

nt

Cyc

le C

ou

nt

Sp

eed

up

Must understand its cost/benefit to decide when to use

Cost of Hardware Multiply

0

2000

4000

6000

8000

10000

12000

0 200 400 600 800 1000 1200 1400 1600 1800


Ave

rag

e W

all C

lock

Tim

e (u

s)

Multiply Full Hardware SupportMultiply Software RoutineAltera NiosIIeAltera NiosIIsAltera NiosIIf

~250 LEs (20%) 35% more Energy/cycle

Shifter Implementations Shifters (multiplexers) are big in FPGAs Consider 3 implementations:

Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter

Impact of Shifter Implementation

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600


Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

Consistent across different pipe depths

Shifter Implementation TradeoffsArea Wall Clock Time Energy per Cycle(LEs) (us) (nJ/cycle)

Serial 1035 3458 0.2114Multiplier-based barrel 1102 1945 0.2174LUT-based barrel 1297 1916 0.2409

Averaged over all pipeline depths Smallest: Serial Fastest: LUT-based barrel Energy efficient: Serial

Multiplier is very nice sweet spot

Pipelines - Depth Study different pipeline depths

Over 3 shifters

Arrows = possible forwarding lines (not used)

All use predict not-taken branches

Pipelining & clock frequency

0

20

40

60

80

100

120

Serial Mul-based LUT-based AVERAGE

Fre

qu

ency

(M

Hz) 2-stage

3-stage

4-stage

5-stage

7-stage

Impact of Pipelining

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600


Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

Adds area, can increase speed (2 to 3 stage?)

Mul

FPGA Nuance: Synchronous RAMs 2-stage Pipeline

Ifetch Regfile

ALU WriteBack

DataMem

Stall on all loads, and any operand fetches

Mul

3-stage Pipeline

Ifetch Regfile

ALU WriteBack

DataMem

Less stalls, increased frequency => Big speedup (1.7x)

3, 4 and 5 stage pipelines Increased area, small change in performance

=> Deeper pipelines have potential for better speedups

Serial

Multiplier-based

LUT-based

1000

1500

2000

2500

3000

3500

4000

4500

5000

800 1000 1200 1400 1600


Avera

ge W

all C

lock T

ime (

us)

2-stage

3-stage

4-stage

5-stage

7-stage

The 7-stage Pipeline Where Branch Delay Slots break down

The ideal case:

BEQOR JR ADDXX Neversquashthisstage

…

Problem: Separation of Branch and Branch Delay Slot

BEQADDJR

Stalls onRAW hazard

…

Problem: Separation of Branch and Branch Delay Slot

BEQADDJR NOPX Must track and protect delay slots

…

Multiple Delay Slots

Must detect separation of branch from delay slot

OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency)

BEQOR JR ADD

Can’t guard all delay slots

Better off eliminating delay slots – currently researching

…

Pipeline organization Where stages are placed is important Pipe stage placement can

Result in all around “win/loss” Present a tradeoff

LUT-basedMul-based

Serial

0

500

1000

1500

2000

2500

3000

3500

4000

800 900 1000 1100 1200 1300 1400

Area (LEs)

Wa

ll C

loc

k T

ime

(u

s)

4-Stage (H)

4-Stage (B)

Forwarding SPREE supports stage to stage forwarding

Mul

IfetchRegFile ALU Write

Back

DataMem

Forward line rs

Forward line rt

Effect of Forwarding

no forwarding

forward rt

forward rs

forward rs&rt

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

800 900 1000 1100 1200 1300 1400 1500 1600


Ave

rag

e W

all

Clo

ck T

ime

(us)

3-stage

4-stage

5-stage

20% speed increase

An Aside: ISA Subsetting Applications don’t generally use all

instructionsISA Usage In Each Benchmark

0.00%

50.00%

100.00%

bubble

_sort

crc

des

fft

fir

quant

iquant

turb

o

vlc

bitcnts

CR

C32

qsort

sha

str

ingsearc

h

FF

T

dijkstr

a

patr

icia

gol

dct

dhry

AV

ER

AG

E

Processor reduction Can strip away unused

components/control Generator supports instruction disabling

Automatically strips away unused components Create an Application Specific processor Do this for each benchmark

FPGAs are a good platform for this!

Area of a Subsetted Processor

Area Measurements for a Processor Subsetted Over Benchmark Set

0

200

400

600

800

1000

1200

1400

OR

IGIN

AL

bu

bb

le_

sort

crc

de

s fft fir

qu

an

t

iqu

an

t

turb

o

vlc

bitc

nts

CR

C3

2

qso

rt

sha

stri

ng

sea

rch

FF

T

dijk

stra

pa

tric

ia

go

l

dct

dh

ry

AV

ER

AG

E

Processor

Are

a (

LE

s)

Speed of a Subsetted Processor

Fmax Measurements for a Processor Subsetted Over Benchmark Set

50.00

52.00

54.00

56.00

58.00

60.00

62.00

64.00

66.00

68.00

70.00

cycl

es

bubb

le_s

ort

crc

des fft fir

quan

t

iqua

nt

turb

o

vlc

bitc

nts

CR

C32

qsor

t

sha

strin

gsea

rch

FF

T

dijk

stra

patr

icia go

l

dct

dhry

AV

ER

AG

E

Processor

Fm

ax (

MH

z)

`

Conclusion Understanding architectural trade-offs

=> Maximize efficiency Developed SPREE & measurement

methodology Performed preliminary architectural study

Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization,

forwarding

Documents

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005