Download pdf - Isca Presentation Final - Illinois · Isca Presentation Final Author: Pablo Montesinos Created Date: 11/28/2010 3:58:55 PM

DeLorean: Recording and Deterministically Replaying Shared

Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Department of Computer Science and EngineeringUniversity of Washington

Pablo Montesinos DeLorean

Motivation

Debugging parallel applications on a multicore is challenging:

Bugs can be very timing dependent

Their effects might manifest after many instructions

Need effective debugging techniques for parallel applications on MPs

One such technique: HW-assisted deterministic replay

2


HW-Assisted Deterministic Replay

Phase I: Initial execution

HW records into a log the order of dependences between threads

This log captures the “interleaving” of threads

Phase II: Replay

Re-run the parallel application

Enforce the dependence orders stored in the log

Bacon and Goldstein. WPDD 91

3


Contributions

Propose DeLorean: a new scheme for hw-assisted deterministic replay of parallel programs based on “chunk-based” execution

Records initial execution at production-run speeds

Requires only small log (0.6% of current schemes)

Replays at high-speed

Has multiple modes of execution with different speed/logging requirements

4

delorean


Outline

Background on replay schemes

DeLorean: a Chunk-based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

5


Point-to-Point Schemes: FDR and RTR

Record dependences between individual instructions of different threads

Optimizations to reduce log size

Transitive reduction

Regulated transitive reduction

Use Sequential Consistency (SC)

RTR proposes an algorithm to support TSO:

Record value of loads that violate SC

P1

Wb

P2

Rb

n

m

P2’s LogP2’s LogP2’s Log

P1 n m

FDR (ISCA 03)

6

RTR (ASPLOS 06)


Vector-based Schemes: Strata

Stratum: vector of as many counters as processors:

Count # of memory operations executed

Log stores sequence of strata

It also uses SC

Log size is similar to RTR’s

Strata (ASPLOS 06)

Ra

P1 P2 P3Wa

Wc 1 1 2

7

Stratum

RdRb

Log


Episode-Based Schemes: ReRun8

ReRun (ISCA 08)

delorean


Outline


DeLorean: a Chunk-based replay scheme


DeLorean Replay

Evaluation

9


Chunk-Based Execution

Memory ops can be reordered and overlapped within and across chunks

Memory access interleaving happens at chunk boundaries

Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)

P1

ld Xld Zld Xcommit

P0

st Xst Y

P0 P1

st Xst Y

ld Tld Zst W

re-execute

ld Xld Zld X

P0

st Xst Y

P1

ld Tld Zst W

Group consecutive dynamic instructions

commit

commit

Execute chunks atomically

Execute chunks in isolation

chunk chunk

10

P0

Limit possible interleavings

Mem

P1

P0collision


A Great Substrate for MP Replay

Overlapping/reordering enables high-speed execution and replay

Performance similar to a system with Release Consistency (RC)

Limited memory interleaving minimizes logging requirements

No need to record individual dependences

11


Chunk-Based Replay: DeLorean

Replaying a chunk-based system means:

Generate same chunks during initial execution and replay

And commit them in the same order

DeLorean’s Log stores total order of chunk commits and their size

DeLorean’s log is very small:

Each entry is short: (ProcID, ChunkSize)

Log updated infrequently

Aggressive designs can make it even smaller

12

commit

st Xst Y

P1P0

ld Ald Bld Ccommitst A

st Xld Zst Zst T

commit

ld Yst Bst Acommit

Log

P0 2

P1 3

P1 3

P0 5


Basic DeLorean Operation13

Proc 0

Commit Request Commit Request

0

1

Ok Ok

PI Log

CS LogA Size

ProcID = 0

ProcID = 1

Chunk A

Proc 1

CS LogB Size

Chunk BArbiter

Log implemented as two different structures:

Chunk Size (CS) Log

Processor Interleaving (PI) Log

delorean


Outline


DeLorean: a Chunk-Based replay scheme


DeLorean Replay

Evaluation

14


500

500

225

500

500

75

500

450

300

500

100

400

325

Reducing DeLorean’s Log

Use large chunks1

Fixed-size chunking2

Predefined interleaving3

0

1

1

0

0

1

0

1

0

0

0

1

1

2000

1660

1580

1250

PI LogP0

CS Log

0

1

0

0

1

1

0

2000

2000

2000

2000

15

P1CS Log

1850

2000

1950

2000

2000

2000

0

1

0

1

0

1

0


Reducing DeLorean’s Log

Use large chunks1Fewer entries in log

More collisions and cache overflows

+

Fixed-size chunking2No CS log

Need to handle cache overflows

+

Predefined interleaving3No PI log

Performance degradation

+

Almost no

16


DeLorean Execution Modes

PicoNano

17

Predefined interleaving3


Use large chunks1


Use large chunks1

(tiny CS log) (tiny CS log)

(no PI Log)

delorean


Outline


DeLorean: Chunk-Based replay scheme


DeLorean Replay

Evaluation

18


DMA

DMA Log

Network

Overall System

Proc 0

PI Log

CS Log

Chunk A Arbiter

Interrupt Log

I/O Log

Proc 1

CS Log

Chunk B

Interrupt Log

I/O Log

19

Interrupt, I/O and DMA logs also appear in all past replay schemes


DeLorean Replay

Virtual machine monitor loads initial system checkpoint

Interrupt Log provides initial-execution interrupts

I/O Log provides initial-execution I/O inputs

Processors execute normally (they execute in parallel):

Use CS Log to generate chunks with exceptional sizes (from cache overflows)

Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)

20


Also in the paper

Rules for deterministic chunking

How to deal with interrupts, traps, cache overflows, I/O ops

Scalability of Pico

A detailed view of DeLorean’s architecture

Proof of why DeLorean’s replay is deterministic

Combination with other HW replay schemes

21

delorean


Outline


DeLorean: Chunk-Based replay scheme


DeLorean Replay

Evaluation

22


Evaluation Setup

Chunk-based simulator supports both execution and replay

8-processor CMP

1000, 2000, 3000 instructions per chunk

PI Log: 4-bit entries

CS Log: 32-bit entries

Applications: SPLASH, SPECjbb 2000 and SPECweb 2005

23


0

2

4

6

8

10

1000 2000 30001000 2000 30000

2

4

6

8

10

1000 2000 3000

Log Space Requirements

Nano’s log is 16% of RTR’s (2000-instruction configuration)

With one additional optimization, Nano’s log is 7.5% of RTR’s

Picos’s log is 0.6% of RTR’s (1000-instruction configuration)

PI log (Compressed)CS log (Compressed)

Log

size

(bits

/cor

e/ki

lo-in

st)

Average RTRcompressedlog size(estimated)

1000 2000 3000

CS log

Instructions per chunk

Average RTRcompressedlog size(estimated)

Instructions per chunk

Log

size

(bits

/cor

e/ki

lo-in

st)

PicoNano

24

(Compressed)


If We Were To Record a Day of Execution

FDR RTR Nano Pico

1.6

60.0

400.0

GB

/Cor

e/D

ay

400

300

200

100

0

Execution and replay of long periods of time is feasible

25

10000.0


Performance

0

0.2

0.4

0.6

0.8

1.0

Spe

edup

Initial Execution

Initial execution: Nano almost as fast as RC, Pico faster than SC

Replay: comparable to initial execution speed

Initial and Replay Execution

SCPicoNanoRC Pico ReplayNano Replay

26

R PO SR R


Conclusions

DeLorean is a new approach to HW replay of parallel applications

Advantages over traditional HW replay systems:

Executes at high speed (like most aggressive consistency models)

Replays at comparable speed

Summarizes execution into a very small log (0.6% of current schemes)

Great help for programmers that need to debug parallel codes

27


Future Work

Adapt DeLorean to conventional multiprocessors without chunk-based execution

Take conventional replay schemes, use very aggressive SC implementations, and compare to DeLorean

Combine the best aspects of the different recording approaches (DeLorean, RTR and Strata)

28

DeLorean: Recording and Deterministically Replaying Shared

Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Department of Computer Science and EngineeringUniversity of Washington