Download ppt - ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir

ISCA-36 :: June 23, 2009

Decoupled Store CompletionSilent Deterministic ReplayEnabling Scalable Data Memory for CPR/CFP Processors

Andrew Hilton, Amir RothUniversity of Pennsylvania

{adhilton, amir}@cis.upenn.edu

[ 2 ][ 2 ]

Brief Overview

Dynamically scheduled superscalar processorsDynamically scheduled superscalar processors

Scalable load & store queuesScalable load & store queues

SVW/SQIP [Roth05, Sha05]SVW/SQIP [Roth05, Sha05]

Latency-tolerant processorsLatency-tolerant processors

CPR/CFP [Akkary03, Srinivasan04]CPR/CFP [Akkary03, Srinivasan04]

DKIP, FMC [Pericas06, Pericas07]DKIP, FMC [Pericas06, Pericas07]

Scalable load & store queues for latency-tolerant processorsScalable load & store queues for latency-tolerant processors

SA-LQ/HSQ [Akkary03]

SRL [Gandhi05]

ELSQ [Pericas08]

SA-LQ/HSQ [Akkary03]

SRL [Gandhi05]

ELSQ [Pericas08]

Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)

Granularity mismatch: checkpoint (CPR) vs. instruction (SVW/SQIP)

Decoupled Store Completion & Silent Deterministic Replay

Decoupled Store Completion & Silent Deterministic Replay

[ 3 ][ 3 ]

Outline

• Background• CPR/CFP• SVW/SQIP• The granularity mismatch problem

• DSC/SDR• Evaluation

[ 4 ][ 4 ]

CPR/CFP

Latency-tolerant: scale key window structures under LL$ miss• Issue queue, regfile, load & store queues

CFP (Continual Flow Pipeline) [Srinivasan04]

• Scale issue queue & regfile by “slicing out” miss-dependent insns

CPR (Checkpoint Processing & Recovery) [Akkary03]

• Scale regfile by limiting recovery to pre-created checkpoints+ Aggressive reclamation of non-checkpoint registers– Unintended consequence? checkpoint-granularity “bulk commit”

SA-LQ (Set-Associative Load Queue) [Akkary03]

HSQ (Hierarchical Store Queue) [Akkary03]

[ 5 ][ 5 ]

Baseline Performance (& Area)

• ASSOC (baseline): 64/48 entry fully-associative load/store queues• 8SA-LQ/HSQ: 512-entry load queue, 256-entry store queue

– Load queue: area is fine, poor performance (set conflicts)– Store queue: performance is fine, area inefficient (large CAM)

[ 6 ][ 6 ]

SQIP

SQIP (Store Queue Index Prediction) [Sha05]

• Scales store queue/buffer by eliminating associative search• @dispatch: load predicts store queue position of forwarding store• @execute: load indexes store queue at this position

dispatch <ssn=9>dispatch <ssn=9> commit <ssn=4>commit <ssn=4><8>

A:StB:LdP:StQ:StR:Ld …S:+T:Br[?] [?] [?] [x20][x10]

instruction streaminstruction stream olderolderyoungeryounger

addressesaddresses

<4><8><9>

Preliminaries: SSNs (Store Sequence Numbers) [Roth05]

• Stores named by monotonically increasing sequence numbers• Low-order bits are store queue/buffer positions• Global SSNs track dispatch, commit, (store) completion

SSNsSSNsP:StP:St

[ 7 ][ 7 ]

SVW

Store Vulnerability Window (SVW) [Roth05]

• Scales load queue by eliminating associative search• Load verification by in-order re-execution prior to commit• Highly filtered: <1% of loads actually re-execute

x18 <8>x20 <9>

SSBF (SSN Bloom Filter)x?8x?0

• Address-indexed SSBF tracks [addr, SSN] of commited stores• @commit: loads check SSBF, re-execute if possibly incorrect

verify/verify/

x18 <8>x?8

<4><9>complete <3>complete <3>

[x20] [x20][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br

<8><8>

[x18] [x18]

commit <9>commit <9><8><8>

[x18] [x18]

[ 8 ][ 8 ]

SVW–NAIVE

• SVW: 512-entry indexed load queue, 256-entry store queue– Slowdowns over 8SA-LQ (mesa, wupwise)– Some slowdowns even over ASSOC too (bzip2, vortex) • Why? Not forwarding mis-predictions … store-load serialization• Load Y can’t verify until older store X completes to D$

[ 9 ][ 9 ]

Store-Load Serialization: ROB

SVW/SQIP example: SSBF verification “hole”• Load R forwards from store <4> vulnerable to stores <5>–<9>• No SSBF entry for address [x10] must replay• Can’t search store buffer wait until stores <5>–<8> in D$

• In a ROB processor … <8> (P) will complete (and usually quickly)• In a CPR processor …

complete <3>complete <3><4><8><9><4>

[x10][x20][x20][x18][x10]A:StB:LdP:StQ:StR:Ld …S:+T:Br

verify/commit <9>verify/commit <9>

x18 <8>x20 <9>

x?8x?0 x20 <9>x?0

complete <8>complete <8>verify/commit <9>verify/commit <9>

x18 <8>x20 <9>

x?8x?0

[ 10 ][ 10 ]

Store-Load Serialization: CPR

P will complete … unless it’s in same checkpoint as R• Deadlock: load R can’t verify store P can’t complete• Resolve: squash (ouch), on re-execute, create checkpoint before R• P and R will be in separate checkpoints

• Better: learn and create checkpoints before future instances of R• This is SVW–TRAIN

<4><8><9>verify <9>verify <9> complete <3>complete <3>

<4>


x18 <8>x20 <9>

x?8x?0

commitcommit

[ 11 ][ 11 ]

SVW–TRAIN

+ Better than SVW–NAÏVE– But worse in some cases (art, mcf, vpr)

• Over-checkpointing holds too many registers• Checkpoint may not be available for branches

[ 12 ][ 12 ]

What About Set-Associative SSBFs?

+ Higher associativity helps (reduces hole frequency) but …– We’re replacing store queue associativity with SSBF associativity

• Trying to avoid things like this• Want a better solution…

[ 13 ][ 13 ]

DSC (Decoupled Store Completion)

No fundamental reason we cannot complete stores <4> – <9>• All older instructions have completed

What’s stopping us? definition of commit & architected state• CPR: commit = oldest register checkpoint (checkpoint granularity)• ROB: commit = SVW-verify (instruction granularity)• Restore ROB definition

• Allow stores to complete past oldest checkpoint• This is DSC (Decoupled Store Completion)

verify <6>verify <6> complete <3>complete <3><4><8><9><4>


commitcommit

complete <8>complete <8>

commitcommit


[ 14 ][ 14 ]

DSC: What About Mis-Speculations?

DSC: Architected state younger than oldest checkpoint

What about mis-speculation (e.g., branch T mis-predicted)?• Can only recover to checkpoint• Squash committed instructions? • Squash stores visible to other processors? etc.

How do we recover architected state?



complete <8>complete <8><4><8><9><4>

T:Br ??

[ 15 ][ 15 ]

Silent Deterministic Recovery (SDR)

Reconstruct architected state on demand• Squash to oldest checkpoint and replay …• Deterministically: re-produce committed values• Silently: without generating coherence events

• How? discard committed stores at rename (already in SB or D$)• How? read load values from load queue• Avoid WAR hazards with younger stores • Same thread (e.g., BQ) or different thread (coherence)

<4><8><9>verify/commit <9>verify/commit <9> complete <8>complete <8>

<4>


[ 16 ][ 16 ]

Outline

• Background• DSC/SDR (yes, that was it)• Evaluation• Performance• Performance-area trade-offs

[ 17 ][ 17 ]

Performance Methodology

Workloads• SPEC2000, Alpha AXP ISA, -O4, train inputs, 2% periodic sampling

Cycle-level simulator configuration• 4-way superscalar out-of-order CPR/CFP processor• 8 checkpoints, 32/32 INT/FP issue queue entries• 32KByte D$, 15-cycle 2MByte L2, 8 8-entry stream prefetchers• 400 cycle memory, 4Byte/cycle memory bus

[ 18 ][ 18 ]

SVW+DSC/SDR

+ Outperforms SVW–Naïve and SVW–Train+ Outperforms 8SA-LQ on average (by a lot)– Occasional slight slowdowns (eon, vortex) relative to 8SA-LQ

• These are due to forwarding mis-speculation

[ 19 ][ 19 ]

Smaller, Less-Associative SSBFs

• Does DSC/SDR make set-associative SSBFs unnecessary?• You can bet your associativity on it

[ 20 ][ 20 ]

Fewer Checkpoints

DSC/SDR reduce need for large numbers of checkpoints• Don’t need checkpoints to serialize store/load pairs• Efficient use of D$ bandwidth even with widely spaced checkpoints• Good: checkpoints are expensive

[ 21 ][ 21 ]

… And Less Area

Area methodology • CACTI-4 [Tarjan04], 45nm• Sum areas for load/store queues (SSBF & predictor too if needed)• E.g., 512-entry 8SA-LQ / 256-entry HSQ

6.6% speedup, 0.91mm26.6% speedup, 0.91mm2

High-performance/low-areaHigh-performance/low-area

[ 22 ][ 22 ]

How Performance/Area Was Won

+ SVW load queue: big performance gain (no conflicts) & small area loss+ SQIP store queue: small performance loss & big area gain (no CAM)• Big SVW performance gain offsets small SQIP performance loss• Big SQIP area gain offsets small SVW area loss+ DSC/SDR: big performance gain & small area gain

[ 23 ][ 23 ]

DSC/SDR Performance/Area

DSC/SDR improve SVW/SQIP IPC and reduce its area• No new structures, just new ways of using existing structures+ No SSBF checkpoints+ No checkpoint-creation predictor+ More tolerant to reduction in checkpoints, SSBF size

[ 24 ][ 24 ]

Pareto Analysis

SVW/SQIP+DSC/SDR dominates all other designs• SVW/SQIP are low area (no CAMs)• DSC/SDR needed to match IPC of fully-associative load queue (FA-LQ)

[ 25 ][ 25 ]

Related Work

SRL (Store Redo Log) [Gandhi05]

• Large associative store queue FIFO buffer + forwarding cache• Expands store queue only under LL$ misses under-performs HSQ

Unordered late-binding load/store queues [Sethumadhavan08]

• Entries only for executed loads and stores• Poor match for centralized latency tolerant processors

Cherry [Martinez02]

• “Post retirement” checkpoints• No large load/store queues, but may benefit from DSC/SDR

Deterministic replay (e.g., race debugging) [Xu04, Narayanasamy06]

[ 26 ][ 26 ]

Conclusions

Checkpoint granularity …+ … register management: good– … store commit: somewhat painful

DSC/SDR: the good parts of the checkpoint world• Checkpoint granularity registers + instruction granularity stores• Key 1: disassociate commit from oldest register checkpoint• Key 2: reconstruct architected state silently on demand

• Committed load values available in load queue+ Allow checkpoint processor to use SVW/SQIP load/store queues

• Performance and area advantages+ Simplify multi-processor operation for checkpoint processors

[ 27 ][ 27 ][ 27 ][ 27 ]