Upload
thomasine-henry
View
232
Download
2
Tags:
Embed Size (px)
Citation preview
The Microarchitecture of FPGA-Based Soft Processors
Peter Yiannacouras
CARG - June 14, 2005
FPGA vs ASIC Flows
CircuitDesign
ASIC Flow FPGA Flow
CircuitDesign
Reduced cost for low-volume
Reduced time-to-market
Programmability affords customization
Designers use FPGAs!
Processors and FPGAs
Custom Logic Processor
FPGA
Custom Logic Processor
Increased board area, cost, and latency
□ Option 1: Off-chip processor
Custom Logic Processor
FPGA
Specialized part, lack of flexibility
□ Option 2: On-chip “hard” processor
Custom Logic Processor
FPGA
Can implement any number of processors
Tune each one to meet design constraints
□ Option 3: On-chip “soft” processor
Custom Logic Processor
Tuning Processors
Application,Design constraints
• $3• 4 MHz• 800 mW• 2-stage pipeline
• $300• 3.8 GHz• 80 W• 31-stage pipeline
Application,Design constraints• 500 LEs
• 40 MHz• 2-stage pipeline
• 1700 LEs• 160 MHz• 6-stage pipeline
Tuning Soft Processors
Application,Design constraints• 500 LEs
• 40 MHz• 2-stage pipeline
• 1700 LEs• 160 MHz• 6-stage pipeline
• your area, speed, power tradeoff
Automatically Tuning Soft Processors
Understanding Soft Processors Tuning requires
understanding of soft processor design space
We implement many processors and study the design space
ArchitectureDescription
SynthesizedProcessor
• Area• Performance• Power
Don’t we already understand architecture? Not completely
We can evaluate area, power, performance
Not accurately (rules of thumb) FPGA CAD tools are very accurate
Not in the FPGA domain LUTs vs transistors relative speed of RAM & Multipliers
Goals
1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft
processor(s)
Measurement Methodology Require a set of metrics
Area
Performance
Power
FPGA Flow
CircuitDesign (RTL)
• Resource Usage• Clock Frequency• Power estimate
AreaLogic Elements (LEs – LUT & flip flop)
Multipliers
Big RAM
Little RAM
Medium RAM
Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area)
Performance Wall Clock Time = #Cycles * Clock Period
CAD Tool
dct, golRATEs
bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlcXirisc
Dhrystone 2.1Freescale
bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patriciaMiBench
BenchmarkSource
From RTLSimulation,Averaged over 20 benchmarks:
Power CAD tool can estimate power from
assumed toggle ratio (derived experimentally)
Total DynamicPower (mW)
÷ Clock Frequency (MHz)
=Dynamic Energyexcluding I/O per cycle (nJ/cycle)
Metrics summary Require the following information
1. Resource Usage (area – CAD Tool)2. Clock Frequency (wall clock time – CAD Tool)3. Power Estimate (energy/cycle – CAD Tool)4. Cycle Count (wall clock time – RTL Simulator)
RTL-based Design Space Exploration
Complete and accurate understanding of design space
CircuitDesign (RTL)
3. Area4. Clock Frequency5. Power
1. Correctness2. Cycle Count
CADTool
RTLSimulator
Benchmarks
Goals
1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft
processor(s)
Microarchitectural Design Space Exploration
Need fast route to RTL from architectural idea
CircuitDesign (RTL)
3. Area4. Clock Frequency5. Power
1. Correctness2. Cycle Count
CADTool
RTLSimulator
Benchmarks
SPREE (Soft Processor Rapid Exploration Environment)
3. Area4. Clock Frequency5. Power
1. Correctness2. Cycle Count
CAD ToolRTL Simulator
Benchmarks
SPREERTL Generator
Goals
1. Develop measurement methodology2. Populate the design space
1. Rapidly2. With interesting designs3. Accurately (minimize overhead)
3. Compare against industrial soft processor(s)
SPREE
Related Work Parametrized Cores
Narrow design space, laborious changes to control
Architecture Description Languages (ADLs) Too robust, inaccurate (simulator based, or
behavioural RTL) PEAS-III/ASIPMeister [Itoh2000]
non-fpga specific, ISA design focus
SPREE RTL Generator Overview
SPREERTL Generator
ComponentLibrary
ISA Description Datapath Description
EfficientlySynthesizable
RTL
InterestingAllows for interesting architectures
Rapidlysimple descriptions
Accuratelyefficient componentimplementations
Some current limitations No caches (use fast on-chip RAM) Simple in-order issue pipelines No dynamic branch prediction No OS or exceptions support
No ISA changes! Need compiler generation to support Use subset of MIPS-I
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Architecture Input
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Component Library
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Architecture Input
Component Library
Mul
Ifetch Regfile
ALU WriteBack
DataMem
Datapath Description
Architecture Input
SPREERTL GeneratorMul
Ifetch Reg File
ALU WriteBack
Mul
Ifetch Reg File
ALU WriteBack
DataMem
Mul
IF
Regfile
ALU WriteBack
Data MemISA Description
Datapath Description
Component Library
Mul
IF
Regfile
ALU WriteBack
Data Mem
Decode Decode Decode
• Control generation savestime and is non-critical
Architecture Input:ISA Description
Generic Operations (GENOPs) MIPS instructions made of GENOPs
FETCH
RFREAD
ADD
RFWRITE
GENOPs MIPS ADD – add rd, rs, rt
FETCH
RFREAD
ADD
RFWRITE
RFREAD
Complete Experimental Framework Using SPREE
3. Area4. Clock Frequency5. Power
1. Correctness2. Cycle Count
CAD ToolRTL Simulator
Benchmarks
SPREERTL Generator
ComponentLibrary
ISA Description Datapath DescriptionFIXED
Goals
1. Develop measurement methodology2. Populate the design space3. Compare against industrial soft processor(s)
SPREE
Area
Performance
Power
Altera’s NiosII Second generation soft processor Has three variations:
NiosIIe – unpipelined, no hardware multiply NiosIIs – 5-stages, no branch prediction NiosIIf – 6-stages, dynamic branch prediction
Caveats Supports exceptions, OS, and caches Very similar but tweaked ISA
Design Space vs NiosII Variations
1000
2000
3000
4000
5000
6000
7000
8000
9000
500 700 900 1100 1300 1500 1700 1900
Area (Equivalent LEs)
Av
era
ge
Wa
ll C
loc
k T
ime
(u
s)
Generated Designs
Altera NiosIIe
Altera NiosIIs
Altera NiosIIf
Summary
1. We span the design space2. Remain competitive
Achieved 9% faster and 11% smaller than NiosIIs
=> don’t suffer from prohibitive overhead
Let’s explore some architecture!
Architectural Axes
1. Hardware vs Software Multiplication2. Shifter implementation3. Pipeline
Depth Organization Forwarding
Hardware vs Software Multiplication
Hardware multiplication Increases area & power consumption Speeds up execution
BUT … Not all applications care about speed Not all applications use multiplication
(significantly)
Cycle Count Speedup of Hardware Multiplication
1.01
1.03
1.04 1.
39
2.72 3.00
4.53
6.94
7.87
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
dijk
stra
dhry
qsor
t
fir
FF
T
dct
quan
t
fft
iqua
nt
Cyc
le C
ou
nt
Sp
eed
up
Must understand its cost/benefit to decide when to use
Cost of Hardware Multiply
0
2000
4000
6000
8000
10000
12000
0 200 400 600 800 1000 1200 1400 1600 1800
Area (Equivalent LEs)
Ave
rag
e W
all C
lock
Tim
e (u
s)
Multiply Full Hardware SupportMultiply Software RoutineAltera NiosIIeAltera NiosIIsAltera NiosIIf
~250 LEs (20%) 35% more Energy/cycle
Shifter Implementations Shifters (multiplexers) are big in FPGAs Consider 3 implementations:
Serial shifter LUT-based barrel shifter Multiplier-based barrel shifter
Impact of Shifter Implementation
Serial
Multiplier-based
LUT-based
1000
1500
2000
2500
3000
3500
4000
4500
5000
800 1000 1200 1400 1600
Area (Equivalent LEs)
Avera
ge W
all C
lock T
ime (
us)
2-stage
3-stage
4-stage
5-stage
7-stage
Consistent across different pipe depths
Shifter Implementation TradeoffsArea Wall Clock Time Energy per Cycle(LEs) (us) (nJ/cycle)
Serial 1035 3458 0.2114Multiplier-based barrel 1102 1945 0.2174LUT-based barrel 1297 1916 0.2409
Averaged over all pipeline depths Smallest: Serial Fastest: LUT-based barrel Energy efficient: Serial
Multiplier is very nice sweet spot
Pipelines - Depth Study different pipeline depths
Over 3 shifters
Arrows = possible forwarding lines (not used)
All use predict not-taken branches
Pipelining & clock frequency
0
20
40
60
80
100
120
Serial Mul-based LUT-based AVERAGE
Fre
qu
ency
(M
Hz) 2-stage
3-stage
4-stage
5-stage
7-stage
Impact of Pipelining
Serial
Multiplier-based
LUT-based
1000
1500
2000
2500
3000
3500
4000
4500
5000
800 1000 1200 1400 1600
Area (Equivalent LEs)
Avera
ge W
all C
lock T
ime (
us)
2-stage
3-stage
4-stage
5-stage
7-stage
Adds area, can increase speed (2 to 3 stage?)
Mul
FPGA Nuance: Synchronous RAMs 2-stage Pipeline
Ifetch Regfile
ALU WriteBack
DataMem
Stall on all loads, and any operand fetches
Mul
3-stage Pipeline
Ifetch Regfile
ALU WriteBack
DataMem
Less stalls, increased frequency => Big speedup (1.7x)
3, 4 and 5 stage pipelines Increased area, small change in performance
=> Deeper pipelines have potential for better speedups
Serial
Multiplier-based
LUT-based
1000
1500
2000
2500
3000
3500
4000
4500
5000
800 1000 1200 1400 1600
Area (Equivalent LEs)
Avera
ge W
all C
lock T
ime (
us)
2-stage
3-stage
4-stage
5-stage
7-stage
The 7-stage Pipeline Where Branch Delay Slots break down
The ideal case:
BEQOR JR ADDXX Neversquashthisstage
…
Problem: Separation of Branch and Branch Delay Slot
BEQADDJR
Stalls onRAW hazard
…
Problem: Separation of Branch and Branch Delay Slot
BEQADDJR NOPX Must track and protect delay slots
…
Multiple Delay Slots
Must detect separation of branch from delay slot
OR prevent multiple delay slots Stall branch if a delay slot exists in the pipe We did this one (+30LEs, -15% clock frequency)
BEQOR JR ADD
Can’t guard all delay slots
Better off eliminating delay slots – currently researching
…
Pipeline organization Where stages are placed is important Pipe stage placement can
Result in all around “win/loss” Present a tradeoff
LUT-basedMul-based
Serial
0
500
1000
1500
2000
2500
3000
3500
4000
800 900 1000 1100 1200 1300 1400
Area (LEs)
Wa
ll C
loc
k T
ime
(u
s)
4-Stage (H)
4-Stage (B)
Forwarding SPREE supports stage to stage forwarding
Mul
IfetchRegFile ALU Write
Back
DataMem
Forward line rs
Forward line rt
Effect of Forwarding
no forwarding
forward rt
forward rs
forward rs&rt
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
800 900 1000 1100 1200 1300 1400 1500 1600
Area (Equivalent LEs)
Ave
rag
e W
all
Clo
ck T
ime
(us)
3-stage
4-stage
5-stage
20% speed increase
An Aside: ISA Subsetting Applications don’t generally use all
instructionsISA Usage In Each Benchmark
0.00%
50.00%
100.00%
bubble
_sort
crc
des
fft
fir
quant
iquant
turb
o
vlc
bitcnts
CR
C32
qsort
sha
str
ingsearc
h
FF
T
dijkstr
a
patr
icia
gol
dct
dhry
AV
ER
AG
E
Processor reduction Can strip away unused
components/control Generator supports instruction disabling
Automatically strips away unused components Create an Application Specific processor Do this for each benchmark
FPGAs are a good platform for this!
Area of a Subsetted Processor
Area Measurements for a Processor Subsetted Over Benchmark Set
0
200
400
600
800
1000
1200
1400
OR
IGIN
AL
bu
bb
le_
sort
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia
go
l
dct
dh
ry
AV
ER
AG
E
Processor
Are
a (
LE
s)
Speed of a Subsetted Processor
Fmax Measurements for a Processor Subsetted Over Benchmark Set
50.00
52.00
54.00
56.00
58.00
60.00
62.00
64.00
66.00
68.00
70.00
cycl
es
bubb
le_s
ort
crc
des fft fir
quan
t
iqua
nt
turb
o
vlc
bitc
nts
CR
C32
qsor
t
sha
strin
gsea
rch
FF
T
dijk
stra
patr
icia go
l
dct
dhry
AV
ER
AG
E
Processor
Fm
ax (
MH
z)
`
Conclusion Understanding architectural trade-offs
=> Maximize efficiency Developed SPREE & measurement
methodology Performed preliminary architectural study
Quantified cost of hardware multiplication Explored shift unit implementations Explored pipelines: depth, organization,
forwarding