ISCA-36 :: June 23, 2009
Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors
Andrew Hilton, Amir RothUniversity of Pennsylvania
{adhilton, amir}@cis.upenn.edu
[ 2 ][ 2 ]
Brief Overview
Dynamically scheduled superscalar processorsDynamically scheduled superscalar processors
Scalable load & store queuesScalable load & store queues
SVW/SQIP [Roth05, Sha05]SVW/SQIP [Roth05, Sha05]
Latency-tolerant processorsLatency-tolerant processors
CPR/CFP [Akkary03, Srinivasan04]CPR/CFP [Akkary03, Srinivasan04]
DKIP, FMC [Pericas06, Pericas07]DKIP, FMC [Pericas06, Pericas07]
Scalable load & store queues for latency-tolerant processorsScalable load & store queues for latency-tolerant processors
SA-LQ/HSQ [Akkary03]
SRL [Gandhi05]
ELSQ [Pericas08]
SA-LQ/HSQ [Akkary03]
SRL [Gandhi05]
ELSQ [Pericas08]
Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)
Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)
Decoupled Store Completion & Silent Deterministic Replay
Decoupled Store Completion & Silent Deterministic Replay
[ 3 ][ 3 ]
Outline
• Background• CPR/CFP• SVW/SQIP• The granularity mismatch problem
• DSC/SDR• Evaluation
[ 4 ][ 4 ]
CPR/CFP
Latency-tolerant: scale key window structures under LL$ miss• Issue queue, regfile, load & store queues
CFP (Continual Flow Pipeline) [Srinivasan04]
• Scale issue queue & regfile by “slicing out” miss-dependent insns
CPR (Checkpoint Processing & Recovery) [Akkary03]
• Scale regfile by limiting recovery to pre-created checkpoints+ Aggressive reclamation of non-checkpoint registers– Unintended consequence? checkpoint-granularity “bulk commit”
SA-LQ (Set-Associative Load Queue) [Akkary03]
HSQ (Hierarchical Store Queue) [Akkary03]
[ 5 ][ 5 ]
Baseline Performance (& Area)
• ASSOC (baseline): 64/48 entry fully-associative load/store queues• 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue
– Load queue: area is fine, poor performance (set conflicts)– Store queue: performance is fine, area inefficient (large CAM)
[ 6 ][ 6 ]
SQIP
SQIP (Store Queue Index Prediction) [Sha05]
• Scales store queue/buffer by eliminating associative search• @dispatch: load predicts store queue position of forwarding store• @execute: load indexes store queue at this position
dispatch <ssn=9>dispatch <ssn=9> commit <ssn=4>commit <ssn=4><8>
A:StB:LdP:StQ:StR:Ld …S:+T:Br[?] [?] [?] [x20][x10]
instruction streaminstruction stream olderolderyoungeryounger
addressesaddresses
<4><8><9>
Preliminaries: SSNs (Store Sequence Numbers) [Roth05]
• Stores named by monotonically increasing sequence numbers• Low-order bits are store queue/buffer positions• Global SSNs track dispatch, commit, (store) completion
SSNsSSNsP:StP:St
[ 7 ][ 7 ]
SVW
Store Vulnerability Window (SVW) [Roth05]
• Scales load queue by eliminating associative search• Load verification by in-order re-execution prior to commit• Highly filtered: <1% of loads actually re-execute
x18 <8>x20 <9>
SSBF (SSN Bloom Filter)x?8x?0
• Address-indexed SSBF tracks [addr, SSN] of commited stores• @commit: loads check SSBF, re-execute if possibly incorrect
verify/verify/
x18 <8>x?8
<4><9>complete <3>complete <3>
[x20] [x20][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
<8><8>
[x18] [x18]
commit <9>commit <9><8><8>
[x18] [x18]
[ 8 ][ 8 ]
SVW–NAIVE
• SVW: 512-entry indexed load queue, 256-entry store queue– Slowdowns over 8SA-LQ (mesa, wupwise)– Some slowdowns even over ASSOC too (bzip2, vortex) • Why? Not forwarding mis-predictions … store-load serialization• Load Y can’t verify until older store X completes to D$
[ 9 ][ 9 ]
Store-Load Serialization: ROB
SVW/SQIP example: SSBF verification “hole”• Load R forwards from store <4> vulnerable to stores <5>–<9>• No SSBF entry for address [x10] must replay• Can’t search store buffer wait until stores <5>–<8> in D$
• In a ROB processor … <8> (P) will complete (and usually quickly)• In a CPR processor …
complete <3>complete <3><4><8><9><4>
[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
verify/commit <9>verify/commit <9>
x18 <8>x20 <9>
x?8x?0 x20 <9>x?0
complete <8>complete <8>verify/commit <9>verify/commit <9>
x18 <8>x20 <9>
x?8x?0
[ 10 ][ 10 ]
Store-Load Serialization: CPR
P will complete … unless it’s in same checkpoint as R• Deadlock: load R can’t verify store P can’t complete• Resolve: squash (ouch), on re-execute, create checkpoint before R• P and R will be in separate checkpoints
• Better: learn and create checkpoints before future instances of R• This is SVW–TRAIN
<4><8><9>verify <9>verify <9> complete <3>complete <3>
<4>
[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
x18 <8>x20 <9>
x?8x?0
commitcommit
[ 11 ][ 11 ]
SVW–TRAIN
+ Better than SVW–NAÏVE– But worse in some cases (art, mcf, vpr)
• Over-checkpointing holds too many registers• Checkpoint may not be available for branches
[ 12 ][ 12 ]
What About Set-Associative SSBFs?
+ Higher associativity helps (reduces hole frequency) but …– We’re replacing store queue associativity with SSBF associativity
• Trying to avoid things like this• Want a better solution…
[ 13 ][ 13 ]
DSC (Decoupled Store Completion)
No fundamental reason we cannot complete stores <4> – <9>• All older instructions have completed
What’s stopping us? definition of commit & architected state• CPR: commit = oldest register checkpoint (checkpoint granularity)• ROB: commit = SVW-verify (instruction granularity)• Restore ROB definition
• Allow stores to complete past oldest checkpoint• This is DSC (Decoupled Store Completion)
verify <6>verify <6> complete <3>complete <3><4><8><9><4>
[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
commitcommit
complete <8>complete <8>
commitcommit
verify/commit <9>verify/commit <9>
[ 14 ][ 14 ]
DSC: What About Mis-Speculations?
DSC: Architected state younger than oldest checkpoint
What about mis-speculation (e.g., branch T mis-predicted)?• Can only recover to checkpoint• Squash committed instructions? • Squash stores visible to other processors? etc.
How do we recover architected state?
verify/commit <9>verify/commit <9>
[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
complete <8>complete <8><4><8><9><4>
T:Br ??
[ 15 ][ 15 ]
Silent Deterministic Recovery (SDR)
Reconstruct architected state on demand• Squash to oldest checkpoint and replay …• Deterministically: re-produce committed values• Silently: without generating coherence events
• How? discard committed stores at rename (already in SB or D$)• How? read load values from load queue• Avoid WAR hazards with younger stores • Same thread (e.g., BQ) or different thread (coherence)
<4><8><9>verify/commit <9>verify/commit <9> complete <8>complete <8>
<4>
[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br
[ 16 ][ 16 ]
Outline
• Background• DSC/SDR (yes, that was it)• Evaluation• Performance• Performance-area trade-offs
[ 17 ][ 17 ]
Performance Methodology
Workloads• SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling
Cycle-level simulator configuration• 4-way superscalar out-of-order CPR/CFP processor• 8 checkpoints, 32/32 INT/FP issue queue entries• 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers• 400 cycle memory, 4Byte/cycle memory bus
[ 18 ][ 18 ]
SVW+DSC/SDR
+ Outperforms SVW–Naïve and SVW–Train+ Outperforms 8SA-LQ on average (by a lot)– Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ
• These are due to forwarding mis-speculation
[ 19 ][ 19 ]
Smaller, Less-Associative SSBFs
• Does DSC/SDR make set-associative SSBFs unnecessary?• You can bet your associativity on it
[ 20 ][ 20 ]
Fewer Checkpoints
DSC/SDR reduce need for large numbers of checkpoints• Don’t need checkpoints to serialize store/load pairs• Efficient use of D$ bandwidth even with widely spaced checkpoints• Good: checkpoints are expensive
[ 21 ][ 21 ]
… And Less Area
Area methodology • CACTI-4 [Tarjan04], 45nm• Sum areas for load/store queues (SSBF & predictor too if needed)• E.g., 512-entry 8SA-LQ / 256-entry HSQ
6.6% speedup, 0.91mm26.6% speedup, 0.91mm2
High-performance/low-areaHigh-performance/low-area
[ 22 ][ 22 ]
How Performance/Area Was Won
+ SVW load queue: big performance gain (no conflicts) & small area loss+ SQIP store queue: small performance loss & big area gain (no CAM)• Big SVW performance gain offsets small SQIP performance loss• Big SQIP area gain offsets small SVW area loss+ DSC/SDR: big performance gain & small area gain
[ 23 ][ 23 ]
DSC/SDR Performance/Area
DSC/SDR improve SVW/SQIP IPC and reduce its area• No new structures, just new ways of using existing structures+ No SSBF checkpoints+ No checkpoint-creation predictor+ More tolerant to reduction in checkpoints, SSBF size
[ 24 ][ 24 ]
Pareto Analysis
SVW/SQIP+DSC/SDR dominates all other designs• SVW/SQIP are low area (no CAMs)• DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)
[ 25 ][ 25 ]
Related Work
SRL (Store Redo Log) [Gandhi05]
• Large associative store queue FIFO buffer + forwarding cache• Expands store queue only under LL$ misses under-performs HSQ
Unordered late-binding load/store queues [Sethumadhavan08]
• Entries only for executed loads and stores• Poor match for centralized latency tolerant processors
Cherry [Martinez02]
• “Post retirement” checkpoints• No large load/store queues, but may benefit from DSC/SDR
Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]
[ 26 ][ 26 ]
Conclusions
Checkpoint granularity …+ … register management: good– … store commit: somewhat painful
DSC/SDR: the good parts of the checkpoint world• Checkpoint granularity registers + instruction granularity stores• Key 1: disassociate commit from oldest register checkpoint• Key 2: reconstruct architected state silently on demand
• Committed load values available in load queue+ Allow checkpoint processor to use SVW/SQIP load/store queues
• Performance and area advantages+ Simplify multi-processor operation for checkpoint processors
[ 27 ][ 27 ][ 27 ][ 27 ]