21
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: Cyclone: A Low-Complexity Broadcast-Free A Low-Complexity Broadcast-Free Dynamic Instruction Scheduler Dynamic Instruction Scheduler Dan Ernst - Andrew Hamel - Todd Austin Advanced Computer Architecture Lab The University of Michigan

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Embed Size (px)

Citation preview

Page 1: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone:Cyclone:A Low-Complexity Broadcast-Free Dynamic A Low-Complexity Broadcast-Free Dynamic

Instruction SchedulerInstruction Scheduler

Dan Ernst - Andrew Hamel - Todd Austin

Advanced Computer Architecture Lab

The University of Michigan

Page 2: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Challenges in High-Speed Dynamic SchedulingChallenges in High-Speed Dynamic Scheduling• Broadcast-based dynamic scheduler circuits are:

– High complexity– Power-hungry– Scale poorly

• Global synchronization is becoming increasingly expensive– More Pipeline Stages + Slow Long Wires + Increasing Clock Speeds

= Difficult Global Signal Design– Example: Pipeline stalling

• Memory scheduling is a “second class citizen”– Non-deterministic latencies don’t fit well into current popular dynamic

scheduling paradigm

Page 3: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

GoalsGoals

1) Design a competitive, completely broadcast-free scheduler- Minimize global synchronization

2) Address memory scheduling in a “first class” way

3) Minimize “loose loops”

Page 4: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Difference in ApproachesDifference in ApproachesFrom an instruction’s point of view… :

Scheduling is just figuring out how long to wait.

• Broadcast approach– Instruction’s schedule is “recomputed” every cycle– Polling (“can I go now? How about now?”)

• Cyclone approach– Schedule based on a single timing computation– Instruction is given an execution time once, so no re-computation needed– Put in a timed “router” to execute the schedule as best it can

Page 5: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Conceptual OverviewConceptual Overview

TimingPredictor

Routing/TimingNetwork

DependenceC

heck

FUI I@txI@tx+

I@tx’

Page 6: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Pre-scheduler DesignPre-scheduler DesignI0

I1

I2

I3

PSCHED0

max

max

+

reschedule?

timing table

PSCHED1

I0

I2I3

16

Example Schedule

a) b)

max

+

max

+

depcheck

MUX control

I1

8 6 2

717

18 8

47

Page 7: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone SchedulerCyclone Scheduler

replay?

fnunits

register fileready bits

bypass

REG EX/MEMSCHED

instructionpre-schedulerstore set

predictor

branchpredictor

countdown/replay queue

main queue

(includes timing information)

switchbackdatapaths

I0

2 4

5

43

2 1

1

Not ready!Ready!

Page 8: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone – Switchback ConflictCyclone – Switchback Conflict

replay?

fnunits

register fileready bits

bypass

REG EX/MEMSCHED

instructionpre-schedulerstore set

predictor

branchpredictor

countdown/replay queue

main queue

(includes timing information)

switchbackdatapaths

43

2

Page 9: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone – Switchback ConflictCyclone – Switchback Conflict

replay?

fnunits

register fileready bits

bypass

REG EX/MEMSCHED

instructionpre-schedulerstore set

predictor

branchpredictor

countdown/replay queue

main queue

(includes timing information)

switchbackdatapaths

3210

Page 10: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Architectural MethodologyArchitectural Methodology• Baseline architectural model

– Derived from SimpleScalar 3.0– More sophisticated scheduling support

• Separated ROB and RS• Variable-length pipelines• Selective scheduler replay on memory latency misprediction

– Store Set predictor

• Cyclone model– Replaced scheduling portion of pipeline with Cyclone model– Added timing information to store set predictor

• Simulated SPEC2000 (INT and FP)

Page 11: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone IPCCyclone IPC

0

0.5

1

1.5

2

2.5

3

3.5

art equake facerec mesa swim FP AVG bzip2 gap gcc gzip vortex INTAVG

Ins

tru

cti

on

s P

er

Cy

cle

32-entry 8-wide baseline

Cyclone

Cyclone w/ DoubleSwitchback PathsCyclone w/ RetimingTableCyclone w/ both

Page 12: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Circuit Timing and Area MethodologyCircuit Timing and Area Methodology• Timing – SPICE models

– Critical paths of Cyclone • Switchback paths were very fast• Pre-scheduler dependence check was the critical path

– CAM-style broadcast windows• Used models from last year’s ISCA (Tag Elimination)

– Both used TSMC 0.18m process at 1.8 V– Presented here as Throughput (IPns)

• Area Analysis – Register Bit Equivalent (RBE)– Process-independent analytical model of chip area– Assumed RAM/CAM area scaled quadratically with number of ports– Modeled scheduler structures and extra tables (also RF)– More information in Mulder, Quach, and Flynn. [17]

Page 13: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

4-wide Complexity Analysis4-wide Complexity Analysis

16-entry 4-issue

32-entry 4-issue

64-entry 4-issue

128-entry 4-issue

Cyclone 4-decode 4-issue

0

1

2

3

4

5

6

7

0 10000 20000 30000 40000 50000 60000 70000 80000

Register Bit Equivalent (RBE) Area

Sc

he

du

ler

Th

rou

gh

pu

t -

(IP

ns

)

Page 14: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

8-wide Complexity Analysis8-wide Complexity Analysis

16-entry 8-issue

32-entry 8-issue

64-entry 8-issue

128-entry 8-issue

Cyclone 8-decode 8-issue

Cyclone 4-decode 8-issue

0

1

2

3

4

5

6

7

8

0 50000 100000 150000 200000 250000 300000

Register Bit Equivalent (RBE) Area

Sc

he

du

ler

Th

rou

gh

pu

t (I

Pn

s)

Page 15: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Design Space OverviewDesign Space Overview

16-entry 4-issue

32-entry 4-issue

64-entry 4-issue

128-entry 4-issue

Cyclone 4-decode 4-issue

16-entry 8-issue

32-entry 8-issue

64-entry 8-issue

128-entry 8-issue

Cyclone 4-decode 8-issue

Cyclone 8-decode 8-issue

0

1

2

3

4

5

6

7

8

0 100000 200000 300000 400000 500000 600000 700000

Register Bit Equivalent (RBE) Area

Sc

he

du

ler

Th

rou

gh

pu

t (I

Pn

s)

Page 16: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Complexity OptionsComplexity Options

• Run at higher frequency– Deeper pipelines

• Make the total scheduler size larger– Increase IPC

• Run at same frequency– Much lower power

Page 17: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

ConclusionsConclusions

• Competitive broadcast-free scheduling– Allows high speed circuits at the expense of IPC– Saves chip area

• Power savings…

• Alternative to stalling– Avoid broadcasting across stages by using the replay mechanism

Page 18: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Future Directions…Future Directions…

• Close the IPC gap– Wider queues?

• Complete Power analysis– Trade-off between size and activity rate

• Further opportunities to pipeline the control system– Global synchronization without fast global communication

Page 19: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Page 20: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Cyclone ExtrasCyclone ExtrasImpact of Schedule Optimizations on IPC

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

appl

u00

craf

ty00

eon0

0

fma3

d00

gcc0

0

gzip0

0

luca

s00

pars

er00

swim

00

wupwise

00 AVG

IPC

Cylcone 8-w ide

w / Retiming Table

w / Double Sw itch

Page 21: Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction

Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan

Current/Future WorkCurrent/Future Work• Pipelined Global Control – Low Power• Razor

– Average-case design opportunities

• Simple and effective selective replay implementations (WDDD)– Spawned from previous work (Tag Elimination – ISCA ’02)

• Removing as much global control as possible from pipelines