View
40
Download
0
Category
Tags:
Preview:
DESCRIPTION
ECE Dept. University of Toronto. Presented at RAAW 2006, Orlando , FL. Improving Pipelined Soft Processors with Multithreading. Martin Labrecque Gregory Steffan. Processors and FPGAs. FPGA. Processor. Custom Logic. Soft processors are: Easier to program than HDL Customizable. - PowerPoint PPT Presentation
Citation preview
Improving Pipelined Soft Processors with Multithreading
Martin LabrecqueGregory Steffan
ECE Dept. University of Toronto
Presented at RAAW 2006, Orlando, FL
2
Custom Logic
FPGA
FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric
Processor
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
Soft processors are:•Easier to program than HDL•Customizable
Processors and FPGAs
3
Soft processors in Embedded Systems
What do designers care about?Minimizing area?Matching frequency?Hitting performance target?
We trade-off 4 criteria (soft proc. power is related to area)
Area efficiency: a combined metric
Performance
Area Instr. Count xx Frequency
Cycle Count x Area
4
Multithreading
Replace processor stalls
Fine-grained multithreading: 1 instr. per thread in round-robin
Million Instr. xx Frequency# Cycles x Area
Fill them with instructions from other threadsWhen to switch thread?
Every instruction (e.g. Sun’s Niagara)Convenient technique for in-order processors
5
Avoiding processor stall cycles
Data and control hazards create stall cycles
F
E
W
Traditional execution
3 st
ages F
E
W
FE
W
F
E
WTimeB
EF
OR
E
F F F
E E E
W W W
F F F
E E E
W W W
Ideally, eliminates all stalls 3
stag
es
Time
Multithreading: execute streams of independent instructions
LegendThread1Thread2Thread3
AF
TE
R F
E
W
6
How useful is multithreading?
Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown:
multithreaded SP smaller than multiple SPs with some performance degradation
We go further by showing that:the Area-Efficiency of Multithreaded SP
is GREATER THAN
the Area-Efficiency of Single-Threaded SP
Not straightforward, here is how we did it
7
Outline
Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading
Architectural Support for Multiple Threads
8
Single-Threaded Processor (simplified)
Instr.Mem
PC
+4
Reg.Array
ALU
DataMem
Hazard Detection Logic
Fo
rwar
din
g li
nes
9
2-Threaded Processor (simplified)
Replicate state for each thread
Instr.Mem
PC
+4
PC
Reg.Array
ALU
DataMem
Ctrl.
Hazard Detection Logic
Simplify control logic
10
Additional storage for multiple threads
More efficiently done in FPGA than in ASIC
Increase memory size while preserving frequency
Program counters Registers Data mem.
Multithreading builds on the strengths of FPGAs
N x
11
Outline
Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading
12
Measurement Infrastructure
RTL
2. Resource Usage3. Clock Frequency4. Power
1. Cycle Count
Benchmarks(MiBench,
Dhrystone 2.1,RATES,XiRisc)
Stratix 1S40C5
We can measure area/performance/energy accurately
ModelsimRTL Simulator
Quartus II 5.0CAD Software
Single-Thread ProcessorsSPREE System [FPGA’06]
13
Evaluation methodology
Same benchmark running on all threadsSome mixed benchmarks results in the paper
Run until completion of the last thread Same instruction space
We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM
14
Processors: 3, 5 and 7 stages
Pipe3
Pipe5
Pipe7
F: FetchD: DecodeR: RegisterEX: ExecuteM: MemoryWB: Writeback
Pipe3
Pipe5
Pipe7
R/EX/MF/D WB
DF R/EX1 EX2/M WB
DF R EX2/M EX3/WB1EX1 WB2
Best of each pipeline depth generated by SPREEBy default: thread count = number of pipeline stages
1174 LEs78.3 MHz
1283 LEs86.79 MHz
1557 LEs, 100.59 MHz
15
Area efficiency results
0
10
20
30
40
50
60
70
80
90
single MT single MT single MT
Are
a e
ffici
ency
(M
IPS
/ 1
000 L
Es)
33%77%
106%
Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency
3-stage 5-stage 7-stage
16
IPC results for 3, 5 and 7 stages
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
bubb
le_s
ort
crc
des fft
fir
quan
t
iqua
nt vlc
bitc
nts
gol
Mea
n
IPC
(In
stru
ctio
ns/
cycl
e)
pipe3_mt
pipe5_mt
pipe7_mt
24%, 45% and 104% more instructions per cycle, respectively
0
0,5
1
1,5
2
2,5
MeanNor
mal
ized
IPC
(ins
truct
ions
per
cyc
le).Ideal IPC = 1
IPC versus single-threaded proc.
17
Improvements to the Baseline Multithreaded Soft Processors
Optimize away unpipelined multi-cycle paths
Selection of architectural features1) Multiplier implementation 2) Number of registers 3) Number of threads
Combination of techniques optimizing area efficiency
Optimize away unpipelined multi-cycle paths
18
1- Changing multiplication support
Reg
iste
r fil
e
Multiplier
Hi/Lo
MUX
• Default MIPS has Hi/Lo registers
•3-operand multiplies (NIOS2 and Microblaze)
– Two instructions compute high and low parts
– Avoids replicating Hi and Lo registers support
19
2- Reducing the register file
Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks
•Applicable to the 5-stage processor
•Increases slightly cycle count due to increased register pressure
•Allows area and frequency improvements
1..N 1..N
2N
1..N-k 1..N-k
2N-2k
20
Reducing the Number of Threads
• Usually: # threads = # pipeline stages• Last stage: writeback to non-conflicting register
Positive effect on the 5 and 7-stage processorsHelps meet processing latency deadline (shorter round-robin)Gives designers more flexibility
F F
E E
F
E
W W W
F F
E E
W W
F
E
W3 st
ages
Time
LegendThread1Thread2Thread3
21
Conclusions Multithreaded SPs outperforms Single-threaded
Assumes independent threads Assumes use of on-chip memory
33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features
Multiplier support Number of threads Number of registers
Commercial FPGA makers should have a Multi-Threaded SP
22
Long term goals Multiple multithreaded soft processors
Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people
Stanford/Xilinx platform Collaboration with network researchers
Perform real high bandwidth experiments
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM
Experimental Testbed: NetFPGA
23
Thank you
Martin Labrecque (martinl@eecg.utoronto.ca)Gregory Steffan
ECE Dept. University of Toronto
24
Where do threads come from?
Event processing e.g. multiple sources of interrupts
Packet processinge.g. CAN, RS-485, Ethernet, etc.
Systems handling requests e.g. bus controllers
For now, we consider independent threads
25
300
500
700
900
1100
1300
1500
1700
1900
500 700 900 1100 1300 1500 1700 1900
Area (Equivalent LEs)
Ge
om
ea
n W
all
Clo
ck
Tim
e (
us
) SPREE Processors
Altera Nios II/e
Altera Nios II/s
Altera Nios II/f
SPREE vs Nios II [IEEE TCAD’07]
smaller
faster
26
Architectural Parameters Used in SPREE
We focus on core microarchitecture (for now)
Multiplication Support Hardware FU or software routine
Shifter implementation Flipflops, multiplier, or LUTs
PipeliningDepth
(2-7 stages)
Forwarding lines
27
Contributions on Multithreaded Soft Processors
Multithreaded SP dominate single-threadedprocessors in area and IPC
Demonstrated that these benefitsIncrease with the # of pipeline stages
Explained techniques to optimize awayunpipelined multi-cycle paths
Selection of architectural featuresNumber of threadsNumber of registersMultiplier support
Combination of techniques that optimize area efficiency
28
Unpipelined Multicycle Paths
ST
MT
R/EXF/D EX
Important source of IPC improvement
WB
R/EXF/D M WB
Not practical in STbecause of hazarddetection
Example of 3-stage pipeline with multicycle on load, store, shift and multiplies
29
Changing multiplication support
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op
Nor
mal
ized
Equ
iv. L
Es
/ MH
z / n
J/in
str
AreaFrequencyEnergyPerInstr
3-stage 5-stage 7-stage
For multithreaded SPs, 3op-multiplies always win
30
Reducing the Number of Threads
0
0.2
0.4
0.6
0.8
1
1.2
pipe3_mt_2T pipe5_mt_4T pipe7_mt_6TNor
mal
ized
Equ
iv. L
Es
/ MH
z / n
J/in
str
Area
Frequency
EnergyPerInstr
Positive effect on the 5 and 7-stage processors
31
3. Control Generation
2. Datapath Instantiation
SPREE
SPREE System (Soft Processor Rapid Exploration Environment)
RTL
ISA
Datapath
■ Input: Processor description■ Made of hand-coded components
1. Verify ISA against datapath
■ SPREE System
■ Output: Synthesizable Verilog
ProcessorDescription
32
Multithreading
Replace processor stalls
Fine-grained multithreading: 1 instr. per thread in round-robin
Million Instr. xx Frequency# Cycles x Area
T1 T2 T3 T1 T2 T3Time
Interleaved instructions in pipeline
Fill them with instructions from other threadsWhen to switch thread?
Multiple techniquesMost common: every instruction (e.g. Sun’s Niagara)
33
Experimental Testbed: NetFPGA
Stanford/Xilinx platform Collaboration with network researchers
Perform real high bandwidth experiments
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM
34
Removed load and branch delay slots in the code
Recommended