Upload
baldwin-watts
View
216
Download
0
Embed Size (px)
Citation preview
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone:Cyclone:A Low-Complexity Broadcast-Free Dynamic A Low-Complexity Broadcast-Free Dynamic
Instruction SchedulerInstruction Scheduler
Dan Ernst - Andrew Hamel - Todd Austin
Advanced Computer Architecture Lab
The University of Michigan
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Challenges in High-Speed Dynamic SchedulingChallenges in High-Speed Dynamic Scheduling• Broadcast-based dynamic scheduler circuits are:
– High complexity– Power-hungry– Scale poorly
• Global synchronization is becoming increasingly expensive– More Pipeline Stages + Slow Long Wires + Increasing Clock Speeds
= Difficult Global Signal Design– Example: Pipeline stalling
• Memory scheduling is a “second class citizen”– Non-deterministic latencies don’t fit well into current popular dynamic
scheduling paradigm
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
GoalsGoals
1) Design a competitive, completely broadcast-free scheduler- Minimize global synchronization
2) Address memory scheduling in a “first class” way
3) Minimize “loose loops”
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Difference in ApproachesDifference in ApproachesFrom an instruction’s point of view… :
Scheduling is just figuring out how long to wait.
• Broadcast approach– Instruction’s schedule is “recomputed” every cycle– Polling (“can I go now? How about now?”)
• Cyclone approach– Schedule based on a single timing computation– Instruction is given an execution time once, so no re-computation needed– Put in a timed “router” to execute the schedule as best it can
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Conceptual OverviewConceptual Overview
TimingPredictor
Routing/TimingNetwork
DependenceC
heck
FUI I@txI@tx+
I@tx’
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Pre-scheduler DesignPre-scheduler DesignI0
I1
I2
I3
PSCHED0
max
max
+
reschedule?
timing table
PSCHED1
I0
I2I3
16
Example Schedule
a) b)
max
+
max
+
depcheck
MUX control
I1
8 6 2
717
18 8
47
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone SchedulerCyclone Scheduler
replay?
fnunits
register fileready bits
bypass
REG EX/MEMSCHED
instructionpre-schedulerstore set
predictor
branchpredictor
countdown/replay queue
main queue
(includes timing information)
switchbackdatapaths
I0
2 4
5
43
2 1
1
Not ready!Ready!
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone – Switchback ConflictCyclone – Switchback Conflict
replay?
fnunits
register fileready bits
bypass
REG EX/MEMSCHED
instructionpre-schedulerstore set
predictor
branchpredictor
countdown/replay queue
main queue
(includes timing information)
switchbackdatapaths
43
2
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone – Switchback ConflictCyclone – Switchback Conflict
replay?
fnunits
register fileready bits
bypass
REG EX/MEMSCHED
instructionpre-schedulerstore set
predictor
branchpredictor
countdown/replay queue
main queue
(includes timing information)
switchbackdatapaths
3210
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Architectural MethodologyArchitectural Methodology• Baseline architectural model
– Derived from SimpleScalar 3.0– More sophisticated scheduling support
• Separated ROB and RS• Variable-length pipelines• Selective scheduler replay on memory latency misprediction
– Store Set predictor
• Cyclone model– Replaced scheduling portion of pipeline with Cyclone model– Added timing information to store set predictor
• Simulated SPEC2000 (INT and FP)
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone IPCCyclone IPC
0
0.5
1
1.5
2
2.5
3
3.5
art equake facerec mesa swim FP AVG bzip2 gap gcc gzip vortex INTAVG
Ins
tru
cti
on
s P
er
Cy
cle
32-entry 8-wide baseline
Cyclone
Cyclone w/ DoubleSwitchback PathsCyclone w/ RetimingTableCyclone w/ both
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Circuit Timing and Area MethodologyCircuit Timing and Area Methodology• Timing – SPICE models
– Critical paths of Cyclone • Switchback paths were very fast• Pre-scheduler dependence check was the critical path
– CAM-style broadcast windows• Used models from last year’s ISCA (Tag Elimination)
– Both used TSMC 0.18m process at 1.8 V– Presented here as Throughput (IPns)
• Area Analysis – Register Bit Equivalent (RBE)– Process-independent analytical model of chip area– Assumed RAM/CAM area scaled quadratically with number of ports– Modeled scheduler structures and extra tables (also RF)– More information in Mulder, Quach, and Flynn. [17]
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
4-wide Complexity Analysis4-wide Complexity Analysis
16-entry 4-issue
32-entry 4-issue
64-entry 4-issue
128-entry 4-issue
Cyclone 4-decode 4-issue
0
1
2
3
4
5
6
7
0 10000 20000 30000 40000 50000 60000 70000 80000
Register Bit Equivalent (RBE) Area
Sc
he
du
ler
Th
rou
gh
pu
t -
(IP
ns
)
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
8-wide Complexity Analysis8-wide Complexity Analysis
16-entry 8-issue
32-entry 8-issue
64-entry 8-issue
128-entry 8-issue
Cyclone 8-decode 8-issue
Cyclone 4-decode 8-issue
0
1
2
3
4
5
6
7
8
0 50000 100000 150000 200000 250000 300000
Register Bit Equivalent (RBE) Area
Sc
he
du
ler
Th
rou
gh
pu
t (I
Pn
s)
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Design Space OverviewDesign Space Overview
16-entry 4-issue
32-entry 4-issue
64-entry 4-issue
128-entry 4-issue
Cyclone 4-decode 4-issue
16-entry 8-issue
32-entry 8-issue
64-entry 8-issue
128-entry 8-issue
Cyclone 4-decode 8-issue
Cyclone 8-decode 8-issue
0
1
2
3
4
5
6
7
8
0 100000 200000 300000 400000 500000 600000 700000
Register Bit Equivalent (RBE) Area
Sc
he
du
ler
Th
rou
gh
pu
t (I
Pn
s)
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Complexity OptionsComplexity Options
• Run at higher frequency– Deeper pipelines
• Make the total scheduler size larger– Increase IPC
• Run at same frequency– Much lower power
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
ConclusionsConclusions
• Competitive broadcast-free scheduling– Allows high speed circuits at the expense of IPC– Saves chip area
• Power savings…
• Alternative to stalling– Avoid broadcasting across stages by using the replay mechanism
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Future Directions…Future Directions…
• Close the IPC gap– Wider queues?
• Complete Power analysis– Trade-off between size and activity rate
• Further opportunities to pipeline the control system– Global synchronization without fast global communication
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Cyclone ExtrasCyclone ExtrasImpact of Schedule Optimizations on IPC
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
appl
u00
craf
ty00
eon0
0
fma3
d00
gcc0
0
gzip0
0
luca
s00
pars
er00
swim
00
wupwise
00 AVG
IPC
Cylcone 8-w ide
w / Retiming Table
w / Double Sw itch
Dan Ernst – ISCA-30 – 6/10/03Advanced Computer Architecture LabThe University of Michigan
Current/Future WorkCurrent/Future Work• Pipelined Global Control – Low Power• Razor
– Average-case design opportunities
• Simple and effective selective replay implementations (WDDD)– Spawned from previous work (Tag Elimination – ISCA ’02)
• Removing as much global control as possible from pipelines