Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone:Cyclone:A Low-Complexity Broadcast-Free Dynamic A Low-Complexity Broadcast-Free Dynamic

Instruction SchedulerInstruction Scheduler

Dan Ernst - Andrew Hamel - Todd Austin

Advanced Computer Architecture Lab

The University of Michigan


Challenges in High-Speed Dynamic SchedulingChallenges in High-Speed Dynamic Scheduling• Broadcast-based dynamic scheduler circuits are:

– High complexity– Power-hungry– Scale poorly

• Global synchronization is becoming increasingly expensive– More Pipeline Stages + Slow Long Wires + Increasing Clock Speeds

= Difficult Global Signal Design– Example: Pipeline stalling

• Memory scheduling is a “second class citizen”– Non-deterministic latencies don’t fit well into current popular dynamic

scheduling paradigm


GoalsGoals

1) Design a competitive, completely broadcast-free scheduler- Minimize global synchronization

2) Address memory scheduling in a “first class” way

3) Minimize “loose loops”


Difference in ApproachesDifference in ApproachesFrom an instruction’s point of view… :

Scheduling is just figuring out how long to wait.

• Broadcast approach– Instruction’s schedule is “recomputed” every cycle– Polling (“can I go now? How about now?”)

• Cyclone approach– Schedule based on a single timing computation– Instruction is given an execution time once, so no re-computation needed– Put in a timed “router” to execute the schedule as best it can


Conceptual OverviewConceptual Overview

TimingPredictor

Routing/TimingNetwork

DependenceC

heck

FUI I@txI@tx+

I@tx’


Pre-scheduler DesignPre-scheduler DesignI0

I1

I2

I3

PSCHED0

max

max

+

reschedule?

timing table

PSCHED1

I0

I2I3

16

Example Schedule

a) b)

max

+

max

+

depcheck

MUX control

I1

8 6 2

717

18 8

47


Cyclone SchedulerCyclone Scheduler

replay?

fnunits

register fileready bits

bypass

REG EX/MEMSCHED

instructionpre-schedulerstore set

predictor

branchpredictor

countdown/replay queue

main queue

(includes timing information)

switchbackdatapaths

I0

2 4

5

43

2 1

1

Not ready!Ready!


Cyclone – Switchback ConflictCyclone – Switchback Conflict

replay?

fnunits


bypass

REG EX/MEMSCHED


predictor

branchpredictor


main queue


switchbackdatapaths

43

2


Cyclone – Switchback ConflictCyclone – Switchback Conflict

replay?

fnunits


bypass

REG EX/MEMSCHED


predictor

branchpredictor


main queue


switchbackdatapaths

3210


Architectural MethodologyArchitectural Methodology• Baseline architectural model

– Derived from SimpleScalar 3.0– More sophisticated scheduling support

• Separated ROB and RS• Variable-length pipelines• Selective scheduler replay on memory latency misprediction

– Store Set predictor

• Cyclone model– Replaced scheduling portion of pipeline with Cyclone model– Added timing information to store set predictor

• Simulated SPEC2000 (INT and FP)


Cyclone IPCCyclone IPC

0

0.5

1

1.5

2

2.5

3

3.5

art equake facerec mesa swim FP AVG bzip2 gap gcc gzip vortex INTAVG

Ins

tru

cti

on

s P

er

Cy

cle

32-entry 8-wide baseline

Cyclone

Cyclone w/ DoubleSwitchback PathsCyclone w/ RetimingTableCyclone w/ both


Circuit Timing and Area MethodologyCircuit Timing and Area Methodology• Timing – SPICE models

– Critical paths of Cyclone • Switchback paths were very fast• Pre-scheduler dependence check was the critical path

– CAM-style broadcast windows• Used models from last year’s ISCA (Tag Elimination)

– Both used TSMC 0.18m process at 1.8 V– Presented here as Throughput (IPns)

• Area Analysis – Register Bit Equivalent (RBE)– Process-independent analytical model of chip area– Assumed RAM/CAM area scaled quadratically with number of ports– Modeled scheduler structures and extra tables (also RF)– More information in Mulder, Quach, and Flynn. [17]


4-wide Complexity Analysis4-wide Complexity Analysis

16-entry 4-issue

32-entry 4-issue

64-entry 4-issue

128-entry 4-issue

Cyclone 4-decode 4-issue

0

1

2

3

4

5

6

7

0 10000 20000 30000 40000 50000 60000 70000 80000

Register Bit Equivalent (RBE) Area

Sc

he

du

ler

Th

rou

gh

pu

t -

(IP

ns

)


8-wide Complexity Analysis8-wide Complexity Analysis

16-entry 8-issue

32-entry 8-issue

64-entry 8-issue

128-entry 8-issue



0

1

2

3

4

5

6

7

8

0 50000 100000 150000 200000 250000 300000


Sc

he

du

ler

Th

rou

gh

pu

t (I

Pn

s)


Design Space OverviewDesign Space Overview

16-entry 4-issue

32-entry 4-issue

64-entry 4-issue

128-entry 4-issue


16-entry 8-issue

32-entry 8-issue

64-entry 8-issue

128-entry 8-issue



0

1

2

3

4

5

6

7

8

0 100000 200000 300000 400000 500000 600000 700000


Sc

he

du

ler

Th

rou

gh

pu

t (I

Pn

s)


Complexity OptionsComplexity Options

• Run at higher frequency– Deeper pipelines

• Make the total scheduler size larger– Increase IPC

• Run at same frequency– Much lower power


ConclusionsConclusions

• Competitive broadcast-free scheduling– Allows high speed circuits at the expense of IPC– Saves chip area

• Power savings…

• Alternative to stalling– Avoid broadcasting across stages by using the replay mechanism


Future Directions…Future Directions…

• Close the IPC gap– Wider queues?

• Complete Power analysis– Trade-off between size and activity rate

• Further opportunities to pipeline the control system– Global synchronization without fast global communication



Cyclone ExtrasCyclone ExtrasImpact of Schedule Optimizations on IPC

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

appl

u00

craf

ty00

eon0

0

fma3

d00

gcc0

0

gzip0

0

luca

s00

pars

er00

swim

00

wupwise

00 AVG

IPC

Cylcone 8-w ide

w / Retiming Table

w / Double Sw itch


Current/Future WorkCurrent/Future Work• Pipelined Global Control – Low Power• Razor

– Average-case design opportunities

• Simple and effective selective replay implementations (WDDD)– Spawned from previous work (Tag Elimination – ISCA ’02)

• Removing as much global control as possible from pipelines

Documents

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction