DeLorean: Recording and Deterministically Replaying Shared
Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
*Department of Computer Science and EngineeringUniversity of Washington
Pablo Montesinos DeLorean
Motivation
Debugging parallel applications on a multicore is challenging:
Bugs can be very timing dependent
Their effects might manifest after many instructions
Need effective debugging techniques for parallel applications on MPs
One such technique: HW-assisted deterministic replay
2
Pablo Montesinos DeLorean
HW-Assisted Deterministic Replay
Phase I: Initial execution
HW records into a log the order of dependences between threads
This log captures the “interleaving” of threads
Phase II: Replay
Re-run the parallel application
Enforce the dependence orders stored in the log
Bacon and Goldstein. WPDD 91
3
Pablo Montesinos DeLorean
Contributions
Propose DeLorean: a new scheme for hw-assisted deterministic replay of parallel programs based on “chunk-based” execution
Records initial execution at production-run speeds
Requires only small log (0.6% of current schemes)
Replays at high-speed
Has multiple modes of execution with different speed/logging requirements
4
delorean
Pablo Montesinos DeLorean
Outline
Background on replay schemes
DeLorean: a Chunk-based replay scheme
Advanced DeLorean Execution
DeLorean Replay
Evaluation
5
Pablo Montesinos DeLorean
Point-to-Point Schemes: FDR and RTR
Record dependences between individual instructions of different threads
Optimizations to reduce log size
Transitive reduction
Regulated transitive reduction
Use Sequential Consistency (SC)
RTR proposes an algorithm to support TSO:
Record value of loads that violate SC
P1
Wb
P2
Rb
n
m
P2’s LogP2’s LogP2’s Log
P1 n m
FDR (ISCA 03)
6
RTR (ASPLOS 06)
Pablo Montesinos DeLorean
Vector-based Schemes: Strata
Stratum: vector of as many counters as processors:
Count # of memory operations executed
Log stores sequence of strata
It also uses SC
Log size is similar to RTR’s
Strata (ASPLOS 06)
Ra
P1 P2 P3Wa
Wc 1 1 2
7
Stratum
RdRb
Log
Pablo Montesinos DeLorean
Episode-Based Schemes: ReRun8
ReRun (ISCA 08)
delorean
Pablo Montesinos DeLorean
Outline
Background on replay schemes
DeLorean: a Chunk-based replay scheme
Advanced DeLorean Execution
DeLorean Replay
Evaluation
9
Pablo Montesinos DeLorean
Chunk-Based Execution
Memory ops can be reordered and overlapped within and across chunks
Memory access interleaving happens at chunk boundaries
Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)
P1
ld Xld Zld Xcommit
P0
st Xst Y
P0 P1
st Xst Y
ld Tld Zst W
re-execute
ld Xld Zld X
P0
st Xst Y
P1
ld Tld Zst W
Group consecutive dynamic instructions
commit
commit
Execute chunks atomically
Execute chunks in isolation
chunk chunk
10
P0
Limit possible interleavings
Mem
P1
P0collision
Pablo Montesinos DeLorean
A Great Substrate for MP Replay
Overlapping/reordering enables high-speed execution and replay
Performance similar to a system with Release Consistency (RC)
Limited memory interleaving minimizes logging requirements
No need to record individual dependences
11
Pablo Montesinos DeLorean
Chunk-Based Replay: DeLorean
Replaying a chunk-based system means:
Generate same chunks during initial execution and replay
And commit them in the same order
DeLorean’s Log stores total order of chunk commits and their size
DeLorean’s log is very small:
Each entry is short: (ProcID, ChunkSize)
Log updated infrequently
Aggressive designs can make it even smaller
12
commit
st Xst Y
P1P0
ld Ald Bld Ccommitst A
st Xld Zst Zst T
commit
ld Yst Bst Acommit
Log
P0 2
P1 3
P1 3
P0 5
Pablo Montesinos DeLorean
Basic DeLorean Operation13
Proc 0
Commit Request Commit Request
0
1
Ok Ok
PI Log
CS LogA Size
ProcID = 0
ProcID = 1
Chunk A
Proc 1
CS LogB Size
Chunk BArbiter
Log implemented as two different structures:
Chunk Size (CS) Log
Processor Interleaving (PI) Log
delorean
Pablo Montesinos DeLorean
Outline
Background on replay schemes
DeLorean: a Chunk-Based replay scheme
Advanced DeLorean Execution
DeLorean Replay
Evaluation
14
Pablo Montesinos DeLorean
500
500
225
500
500
75
500
450
300
500
100
400
325
Reducing DeLorean’s Log
Use large chunks1
Fixed-size chunking2
Predefined interleaving3
0
1
1
0
0
1
0
1
0
0
0
1
1
2000
1660
1580
1250
PI LogP0
CS Log
0
1
0
0
1
1
0
2000
2000
2000
2000
15
P1CS Log
1850
2000
1950
2000
2000
2000
0
1
0
1
0
1
0
Pablo Montesinos DeLorean
Reducing DeLorean’s Log
Use large chunks1Fewer entries in log
More collisions and cache overflows
+
Fixed-size chunking2No CS log
Need to handle cache overflows
+
Predefined interleaving3No PI log
Performance degradation
+
Almost no
16
Pablo Montesinos DeLorean
DeLorean Execution Modes
PicoNano
17
Predefined interleaving3
Fixed-size chunking2
Use large chunks1
Fixed-size chunking2
Use large chunks1
(tiny CS log) (tiny CS log)
(no PI Log)
delorean
Pablo Montesinos DeLorean
Outline
Background on replay schemes
DeLorean: Chunk-Based replay scheme
Advanced DeLorean Execution
DeLorean Replay
Evaluation
18
Pablo Montesinos DeLorean
DMA
DMA Log
Network
Overall System
Proc 0
PI Log
CS Log
Chunk A Arbiter
Interrupt Log
I/O Log
Proc 1
CS Log
Chunk B
Interrupt Log
I/O Log
19
Interrupt, I/O and DMA logs also appear in all past replay schemes
Pablo Montesinos DeLorean
DeLorean Replay
Virtual machine monitor loads initial system checkpoint
Interrupt Log provides initial-execution interrupts
I/O Log provides initial-execution I/O inputs
Processors execute normally (they execute in parallel):
Use CS Log to generate chunks with exceptional sizes (from cache overflows)
Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)
20
Pablo Montesinos DeLorean
Also in the paper
Rules for deterministic chunking
How to deal with interrupts, traps, cache overflows, I/O ops
Scalability of Pico
A detailed view of DeLorean’s architecture
Proof of why DeLorean’s replay is deterministic
Combination with other HW replay schemes
21
delorean
Pablo Montesinos DeLorean
Outline
Background on replay schemes
DeLorean: Chunk-Based replay scheme
Advanced DeLorean Execution
DeLorean Replay
Evaluation
22
Pablo Montesinos DeLorean
Evaluation Setup
Chunk-based simulator supports both execution and replay
8-processor CMP
1000, 2000, 3000 instructions per chunk
PI Log: 4-bit entries
CS Log: 32-bit entries
Applications: SPLASH, SPECjbb 2000 and SPECweb 2005
23
Pablo Montesinos DeLorean
0
2
4
6
8
10
1000 2000 30001000 2000 30000
2
4
6
8
10
1000 2000 3000
Log Space Requirements
Nano’s log is 16% of RTR’s (2000-instruction configuration)
With one additional optimization, Nano’s log is 7.5% of RTR’s
Picos’s log is 0.6% of RTR’s (1000-instruction configuration)
PI log (Compressed)CS log (Compressed)
Log
size
(bits
/cor
e/ki
lo-in
st)
Average RTRcompressedlog size(estimated)
1000 2000 3000
CS log
Instructions per chunk
Average RTRcompressedlog size(estimated)
Instructions per chunk
Log
size
(bits
/cor
e/ki
lo-in
st)
PicoNano
24
(Compressed)
Pablo Montesinos DeLorean
If We Were To Record a Day of Execution
FDR RTR Nano Pico
1.6
60.0
400.0
GB
/Cor
e/D
ay
400
300
200
100
0
Execution and replay of long periods of time is feasible
25
10000.0
Pablo Montesinos DeLorean
Performance
0
0.2
0.4
0.6
0.8
1.0
Spe
edup
Initial Execution
Initial execution: Nano almost as fast as RC, Pico faster than SC
Replay: comparable to initial execution speed
Initial and Replay Execution
SCPicoNanoRC Pico ReplayNano Replay
26
R PO SR R
Pablo Montesinos DeLorean
Conclusions
DeLorean is a new approach to HW replay of parallel applications
Advantages over traditional HW replay systems:
Executes at high speed (like most aggressive consistency models)
Replays at comparable speed
Summarizes execution into a very small log (0.6% of current schemes)
Great help for programmers that need to debug parallel codes
27
Pablo Montesinos DeLorean
Future Work
Adapt DeLorean to conventional multiprocessors without chunk-based execution
Take conventional replay schemes, use very aggressive SC implementations, and compare to DeLorean
Combine the best aspects of the different recording approaches (DeLorean, RTR and Strata)
28
DeLorean: Recording and Deterministically Replaying Shared
Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
*Department of Computer Science and EngineeringUniversity of Washington