Improving Pipelined Soft Processors with Multithreading

Martin LabrecqueGregory Steffan

ECE Dept. University of Toronto

Presented at RAAW 2006, Orlando, FL

Custom Logic

FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric

Processor

Instr. Mem.

Reg. Array

Data Mem.

addrdatOut

IncrPC

4:0 Wdest

Xtnd << 2

Zero Test

Soft processors are:•Easier to program than HDL•Customizable

Processors and FPGAs

Soft processors in Embedded Systems

What do designers care about?Minimizing area?Matching frequency?Hitting performance target?

We trade-off 4 criteria (soft proc. power is related to area)

Area efficiency: a combined metric

Performance

Area Instr. Count xx Frequency

Cycle Count x Area

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

Fill them with instructions from other threadsWhen to switch thread?

Every instruction (e.g. Sun’s Niagara)Convenient technique for in-order processors

Avoiding processor stall cycles

Data and control hazards create stall cycles

Traditional execution

ages F

WTimeB

Ideally, eliminates all stalls 3

Multithreading: execute streams of independent instructions

LegendThread1Thread2Thread3

How useful is multithreading?

Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown:

multithreaded SP smaller than multiple SPs with some performance degradation

We go further by showing that:the Area-Efficiency of Multithreaded SP

is GREATER THAN

the Area-Efficiency of Single-Threaded SP

Not straightforward, here is how we did it

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading

Architectural Support for Multiple Threads

Single-Threaded Processor (simplified)

Instr.Mem

Reg.Array

DataMem

Hazard Detection Logic

2-Threaded Processor (simplified)

Replicate state for each thread

Instr.Mem

Reg.Array

DataMem

Hazard Detection Logic

Simplify control logic

Additional storage for multiple threads

More efficiently done in FPGA than in ASIC

Increase memory size while preserving frequency

Program counters Registers Data mem.

Multithreading builds on the strengths of FPGAs

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading

Measurement Infrastructure

2. Resource Usage3. Clock Frequency4. Power

1. Cycle Count

Benchmarks(MiBench,

Dhrystone 2.1,RATES,XiRisc)

Stratix 1S40C5

We can measure area/performance/energy accurately

ModelsimRTL Simulator

Quartus II 5.0CAD Software

Single-Thread ProcessorsSPREE System [FPGA’06]

Evaluation methodology

Same benchmark running on all threadsSome mixed benchmarks results in the paper

Run until completion of the last thread Same instruction space

We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM

Processors: 3, 5 and 7 stages

F: FetchD: DecodeR: RegisterEX: ExecuteM: MemoryWB: Writeback

R/EX/MF/D WB

DF R/EX1 EX2/M WB

DF R EX2/M EX3/WB1EX1 WB2

Best of each pipeline depth generated by SPREEBy default: thread count = number of pipeline stages

1174 LEs78.3 MHz

1283 LEs86.79 MHz

1557 LEs, 100.59 MHz

Area efficiency results

single MT single MT single MT

33%77%

Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency

3-stage 5-stage 7-stage

IPC results for 3, 5 and 7 stages

des fft

nt vlc

pipe3_mt

pipe5_mt

pipe7_mt

24%, 45% and 104% more instructions per cycle, respectively

MeanNor

le).Ideal IPC = 1

IPC versus single-threaded proc.

Improvements to the Baseline Multithreaded Soft Processors

Optimize away unpipelined multi-cycle paths

Selection of architectural features1) Multiplier implementation 2) Number of registers 3) Number of threads

Combination of techniques optimizing area efficiency

Optimize away unpipelined multi-cycle paths

1- Changing multiplication support

Multiplier

• Default MIPS has Hi/Lo registers

•3-operand multiplies (NIOS2 and Microblaze)

– Two instructions compute high and low parts

– Avoids replicating Hi and Lo registers support

2- Reducing the register file

Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks

•Applicable to the 5-stage processor

•Increases slightly cycle count due to increased register pressure

•Allows area and frequency improvements

1..N 1..N

1..N-k 1..N-k

Reducing the Number of Threads

• Usually: # threads = # pipeline stages• Last stage: writeback to non-conflicting register

Positive effect on the 5 and 7-stage processorsHelps meet processing latency deadline (shorter round-robin)Gives designers more flexibility

LegendThread1Thread2Thread3

Conclusions Multithreaded SPs outperforms Single-threaded

Assumes independent threads Assumes use of on-chip memory

33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features

Multiplier support Number of threads Number of registers

Commercial FPGA makers should have a Multi-Threaded SP

Long term goals Multiple multithreaded soft processors

Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

Experimental Testbed: NetFPGA

Thank you

Martin Labrecque (martinl@eecg.utoronto.ca)Gregory Steffan

ECE Dept. University of Toronto

Where do threads come from?

Event processing e.g. multiple sources of interrupts

Packet processinge.g. CAN, RS-485, Ethernet, etc.

Systems handling requests e.g. bus controllers

For now, we consider independent threads

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

) SPREE Processors

Altera Nios II/e

Altera Nios II/s

Altera Nios II/f

SPREE vs Nios II [IEEE TCAD’07]

smaller

faster

Architectural Parameters Used in SPREE

We focus on core microarchitecture (for now)

Multiplication Support Hardware FU or software routine

Shifter implementation Flipflops, multiplier, or LUTs

PipeliningDepth

(2-7 stages)

Forwarding lines

Contributions on Multithreaded Soft Processors

Multithreaded SP dominate single-threadedprocessors in area and IPC

Demonstrated that these benefitsIncrease with the # of pipeline stages

Explained techniques to optimize awayunpipelined multi-cycle paths

Selection of architectural featuresNumber of threadsNumber of registersMultiplier support

Combination of techniques that optimize area efficiency

Unpipelined Multicycle Paths

R/EXF/D EX

Important source of IPC improvement

R/EXF/D M WB

Not practical in STbecause of hazarddetection

Example of 3-stage pipeline with multicycle on load, store, shift and multiplies

Changing multiplication support

Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op

AreaFrequencyEnergyPerInstr

3-stage 5-stage 7-stage

For multithreaded SPs, 3op-multiplies always win

Reducing the Number of Threads

pipe3_mt_2T pipe5_mt_4T pipe7_mt_6TNor

Frequency

EnergyPerInstr

Positive effect on the 5 and 7-stage processors

3. Control Generation

2. Datapath Instantiation

SPREE System (Soft Processor Rapid Exploration Environment)

Datapath

■ Input: Processor description■ Made of hand-coded components

1. Verify ISA against datapath

■ SPREE System

■ Output: Synthesizable Verilog

ProcessorDescription

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

T1 T2 T3 T1 T2 T3Time

Interleaved instructions in pipeline

Fill them with instructions from other threadsWhen to switch thread?

Multiple techniquesMost common: every instruction (e.g. Sun’s Niagara)

Experimental Testbed: NetFPGA

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

Removed load and branch delay slots in the code

Improving Pipelined Soft Processors with Multithreading

Documents

Chapter 2 Pipelined Processors

CS 250B: Modern Computer Systemsswjun/courses/2020S-CS250B...Remember: Pipelined Processors and Hazards Modern, pipelined processors handle multiple instructions at once o Ideally,

Multithreading Processors and Static Optimization Review

Example of the Pipelined CISC and RISC Chapter 06 ... of the Pipelined CISC and RISC Processors Chapter 06: Instruction Pipelining and Parallel Processing Objective • • To understand

Optimizing SMT Processors for High Single-Thread Performancemaggini.eng.umd.edu/pub/doraiTec.pdffgauthamt,yeung,srchoig@eng.umd.edu Abstract Simultaneous Multithreading (SMT) processors

pipelined multithreading transformations and support mechanisms

Constructive Computer Architecture: Multistage Pipelined Processors and modular refinement Arvind

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others

CSL718 : Pipelined Processors

On Data Forwarding in Deeply Pipelined Soft Processors 5/3_fajhmy_fpga2015.pdf · On Data Forwarding in Deeply Pipelined Soft Processors H. Y. Cheah, S. A. Fahmy, N. Kapre School

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others

Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

The pipelined processors

Multi-core processors and multithreading

Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab

Non-Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab

Stall Power Reduction in Pipelined Architecture Processorscs.ipm.ac.ir/~plotfi/papers/stall_vlsid08.pdf · Stall Power Reduction in Pipelined Architecture Processors ... have become

Anshul Kumar, CSE IITD CSL718 : Pipelined Processors PipelineTimings 12th Jan, 2006

Constructive Computer Architecture: Multistage Pipelined …csg.csail.mit.edu/.../lectures/L13-MultistagePipelines.pdf · 2013. 10. 19. · Multistage Pipelined Processors and modular