Reducing Issue Logic Complexity in Superscalar Microprocessors

Reducing Issue Logic Complexity in Superscalar

Microprocessors

Survey ProjectCprE 585 – Advanced Computer Architecture

David LastineGanesh Subramanian

Introduction The ultimate goal of any computer architect –

designing a fast machine Approaches

Increasing clocking rate (Help from VLSI) Increasing bus width Increasing pipeline depth Superscalar architectures

Tradeoffs between hardware complexity and clock speed

Given a particular technology, the more complex the hardware, the lesser is the clocking rate

A New Paradigm Retaining the effective functionality of

complex superscalar processors Target the bottleneck in present day

microprocessors Instruction scheduling is the throughput limiter Need to effectively handle register renaming, issue

window and wakeup selector Increase the clocking rate

Rethinking circuit design methodologies Modifying architectural design strategies

Wanting to have the cake and eat it too? Aim at reducing power consumption too

Approaches to Handle Issue Logic Complexity

Performance = IPC * Clock Frequency Pipelining scheduling logic reduces the IPC Non-pipelined scheduling logic reduces

clocking rate Architectural solutions

Non-pipelined scheduling with dependence queue based issue logic – Complexity Effective [1]

Pipelined scheduling with speculative wakeup [2]

Generic speed up and power conservation using tag elimination [3]

Baseline Superscalar Model

The rename and the wake-up select stages of the generic superscalar pipeline model need to be targeted

Consider VLSI effects and decide to redesign a particular design component

Analyzing Baseline Implementations

Physical layout implementation of microprocessor circuits optimized for speed Usage of dynamic logic for bottleneck circuits Manual sizing of transistors in critical path Logic optimizations like two level decomposition

Components analyzed Register rename logic Wakeup Logic / Issue window Selection logic Bypass logic

Register Rename Logic

RAM vs. CAM Focus on RAM due to scalability Decreasing feature sizes do not correspondingly scale down

wire delays, but only logic delays Delay relation with issue width is quadratic, but effectively

linear Need to handle wordline and bitline delays in future

Wakeup Logic

CAM is preferred Tag drive times are quadratic functions of window size as

well as issue width Matching times are quadratic functions of issue width only All delays are effectively linear for considered design space Need to handle broadcast operation delays in future

Selection Logic

Tree of arbiters Requests flow down while functional unit grants flow up to

the issue window Necessity of a selection policy (Oldest First / Leftmost First) Delays proportional to the logarithm of the window size All delays considered are logic delays

Bypass Logic Number of bypass paths

dependent upon pipeline depth (linear) and issue width (quadratic)

Composed of operand muxes and buffer drivers

Delays are quadratically proportional to length of result wires and hence issue width

Insignificant compared to other delays as feature size reduces

Complexity Effective Microarchitecture Design Premises

Retain benefits of complex issue schemes but enable faster clocking

Design assumption: Should not pipeline wakeup + select, or data bypassing, as these are atomic operations (if dependent instruction should be executable in consecutive cycles)

Dependence Based Microarchitecture

Replace Issue Window by FIFOs with each queue composed of dependent instructions

Steer instructions to the appropriate FIFO in rename stage using heuristics

‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and wakeup

IPC reduces but clocking rate increases to give a faster implementation

Clustering Dependence Based Microarchitectures

Reducing bypass delays by reducing length of bypass paths

Minimization of inter-cluster communication, extra cycle penalty otherwise

Clustered Microarchitecture Types

Single Window, Execution Driven Steering

Two Windows, Dispatch Driven Steering - Best

Two Windows, Random Steering

Pipelining Dynamic Instruction Scheduling Logic

Wakeup+Select was held atomic in previous implementation

Increase performance by pipelining it, but retain execution of dependent instruction in consecutive cycles

Speculate on the wakeup by predicting based on both parent and grandparent instructions

Integrated into the Tomasulo approach

Wakeup Logic Details Tag broadcast as soon as instruction begins execution Broadcast – Execution Completion latency specified as shown Match bit acts as the sticky bit to enable delay countdown Need not always be correct due to unexpected stalls Select logic remains as in previous work

Pipelining Rename Logic

Assumption by child instruction that parent would broadcast its tag in the next cycle, IF grandparent instructions broadcasts tag

Speculative wakeup on grandparent tag receiving for selection in the next cycle

Speculative since parent selection for execution is not guaranteed

Modifications in rename map and dependency analysis logic

Wakeup and Select Logic Wakeup request sent after looking into

ready bits from the parents’ and grandparents’ tags

A multi-cycle parent’s field can be ignored In addition to speculative readiness

signified by request line, a confirm line is activated when all parents are ready

False selection involve non-confirmed requests

Problematic only when really ready instructions are not selected

Implementation & Experimentation Details

Usage of a cycle accurate execution driven simulator for the Alpha ISA

Baseline conventional scheduled (2) pipeline Budget / Deluxe – speculatively woken up scheduling Ideal – 1 cycle scheduling pipeline

Factors like issue width and reservation station depth considered

Significant reduction in critical path with minor IPC impacts

Enables higher clock frequencies, deeper pipelines and larger instruction windows for better performance

Paradigm shift

So far we’ve added hardware to improve performance

However issue window could also be improved by removing hardware

Current Situation of Issue Windows

Content Addressable Memory (CAM) latency dominates instruction window latency.

Load Capacitance of CAM is a major limiting factor for speed.

Parasitic Capacitance also waste power. Issue logic uses a lot of the power

budget 16% for the Pentium Pro 18% for Alpha 21264

Unnecessary Circuity

Observation: Register stations compare broadcast tags to both operands. Often, this is unnecessary.

Only 25% to 35% of architectural instructions have two operands.

Simulation of speck2k programs shows only 10% to 20% of instructions need two comparators during runtime.

Simulation Used SimpleScalar Varied instruction window size 16, 64,

256. Load/Store queue of half window size.

Removing extra comparators Specialize the reservation stations.

Number of comparators varies by station from 2 to 0.

Stall if no station with minimum comparator available

Remove some operands by speculating on last operand to complete. Needs predictor Miss-predict penalty

Predictor Paper discuses GSHARE predictor Its based off branch predictor not seen in

class. Idea behind it starts by noting good indexes

for selecting binary predictors are Branch address Global history

Thus if both are good, XORing them together should produce an index embodying more information than ether alone.

Predictor II

Here is how GSHARE does for various sizes of the prediction table.

Mis-pridiction

Alpha has scoreboard of valid registers called RDY.

Check if all operands available in register read stage, if not flush pipeline in the same fashion as latency miss-prediction.

RDY must be expanded to have the number of read ports match the issue width.

IPC losses

Reservation stations with two ports can be exhausted. Causes stalls for speck2k benchmarks like SWIM

Adding last tag prediction improves SWIM performance but causes 1-3% losses for benchmarks such as Crafly and Gcc due to misprediction

Simulation Format show is for number of two tag/one tag/

zero tag Last tag predictor used only on entries with no

two tag reservation stations.

Benefits of comparator removal In most cases clock rate can be 25-45%

faster since Tag bus no longer must reach all

reservation stations Removing comparators removes load

capacitance Energy saved from capacitance removal

is 30-60% Power savings don’t track energy saves this

clock rate can now increase.

Simulation results for benefits

References

1. Complexity-effective superscalar processors1. Subbarao Palacharla and Norman P. Jouppi and J.

E. Smith

2. On pipelining dynamic instruction scheduling logic

1. J. Stark, M. D. Brown, and Yale N. Patt

3. Efficient Dynamic Scheduling Through Tag Elimination

1. Dan Ernst and Todd Austin

4. Combining Branch Predictors1. Scott McFarling

Questions?

Documents

Reducing Issue Logic Complexity in Superscalar Microprocessors