29
DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently Pablo Montesinos, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu *Department of Computer Science and Engineering University of Washington

Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

DeLorean: Recording and Deterministically Replaying Shared

Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Department of Computer Science and EngineeringUniversity of Washington

Page 2: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Motivation

Debugging parallel applications on a multicore is challenging:

Bugs can be very timing dependent

Their effects might manifest after many instructions

Need effective debugging techniques for parallel applications on MPs

One such technique: HW-assisted deterministic replay

2

Page 3: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

HW-Assisted Deterministic Replay

Phase I: Initial execution

HW records into a log the order of dependences between threads

This log captures the “interleaving” of threads

Phase II: Replay

Re-run the parallel application

Enforce the dependence orders stored in the log

Bacon and Goldstein. WPDD 91

3

Page 4: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Contributions

Propose DeLorean: a new scheme for hw-assisted deterministic replay of parallel programs based on “chunk-based” execution

Records initial execution at production-run speeds

Requires only small log (0.6% of current schemes)

Replays at high-speed

Has multiple modes of execution with different speed/logging requirements

4

Page 5: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

delorean

Pablo Montesinos DeLorean

Outline

Background on replay schemes

DeLorean: a Chunk-based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

5

Page 6: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Point-to-Point Schemes: FDR and RTR

Record dependences between individual instructions of different threads

Optimizations to reduce log size

Transitive reduction

Regulated transitive reduction

Use Sequential Consistency (SC)

RTR proposes an algorithm to support TSO:

Record value of loads that violate SC

P1

Wb

P2

Rb

n

m

P2’s LogP2’s LogP2’s Log

P1 n m

FDR (ISCA 03)

6

RTR (ASPLOS 06)

Page 7: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Vector-based Schemes: Strata

Stratum: vector of as many counters as processors:

Count # of memory operations executed

Log stores sequence of strata

It also uses SC

Log size is similar to RTR’s

Strata (ASPLOS 06)

Ra

P1 P2 P3Wa

Wc 1 1 2

7

Stratum

RdRb

Log

Page 8: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Episode-Based Schemes: ReRun8

ReRun (ISCA 08)

Page 9: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

delorean

Pablo Montesinos DeLorean

Outline

Background on replay schemes

DeLorean: a Chunk-based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

9

Page 10: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Chunk-Based Execution

Memory ops can be reordered and overlapped within and across chunks

Memory access interleaving happens at chunk boundaries

Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)

P1

ld Xld Zld Xcommit

P0

st Xst Y

P0 P1

st Xst Y

ld Tld Zst W

re-execute

ld Xld Zld X

P0

st Xst Y

P1

ld Tld Zst W

Group consecutive dynamic instructions

commit

commit

Execute chunks atomically

Execute chunks in isolation

chunk chunk

10

P0

Limit possible interleavings

Mem

P1

P0collision

Page 11: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

A Great Substrate for MP Replay

Overlapping/reordering enables high-speed execution and replay

Performance similar to a system with Release Consistency (RC)

Limited memory interleaving minimizes logging requirements

No need to record individual dependences

11

Page 12: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Chunk-Based Replay: DeLorean

Replaying a chunk-based system means:

Generate same chunks during initial execution and replay

And commit them in the same order

DeLorean’s Log stores total order of chunk commits and their size

DeLorean’s log is very small:

Each entry is short: (ProcID, ChunkSize)

Log updated infrequently

Aggressive designs can make it even smaller

12

commit

st Xst Y

P1P0

ld Ald Bld Ccommitst A

st Xld Zst Zst T

commit

ld Yst Bst Acommit

Log

P0 2

P1 3

P1 3

P0 5

Page 13: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Basic DeLorean Operation13

Proc 0

Commit Request Commit Request

0

1

Ok Ok

PI Log

CS LogA Size

ProcID = 0

ProcID = 1

Chunk A

Proc 1

CS LogB Size

Chunk BArbiter

Log implemented as two different structures:

Chunk Size (CS) Log

Processor Interleaving (PI) Log

Page 14: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

delorean

Pablo Montesinos DeLorean

Outline

Background on replay schemes

DeLorean: a Chunk-Based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

14

Page 15: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

500

500

225

500

500

75

500

450

300

500

100

400

325

Reducing DeLorean’s Log

Use large chunks1

Fixed-size chunking2

Predefined interleaving3

0

1

1

0

0

1

0

1

0

0

0

1

1

2000

1660

1580

1250

PI LogP0

CS Log

0

1

0

0

1

1

0

2000

2000

2000

2000

15

P1CS Log

1850

2000

1950

2000

2000

2000

0

1

0

1

0

1

0

Page 16: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Reducing DeLorean’s Log

Use large chunks1Fewer entries in log

More collisions and cache overflows

+

Fixed-size chunking2No CS log

Need to handle cache overflows

+

Predefined interleaving3No PI log

Performance degradation

+

Almost no

16

Page 17: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

DeLorean Execution Modes

PicoNano

17

Predefined interleaving3

Fixed-size chunking2

Use large chunks1

Fixed-size chunking2

Use large chunks1

(tiny CS log) (tiny CS log)

(no PI Log)

Page 18: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

delorean

Pablo Montesinos DeLorean

Outline

Background on replay schemes

DeLorean: Chunk-Based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

18

Page 19: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

DMA

DMA Log

Network

Overall System

Proc 0

PI Log

CS Log

Chunk A Arbiter

Interrupt Log

I/O Log

Proc 1

CS Log

Chunk B

Interrupt Log

I/O Log

19

Interrupt, I/O and DMA logs also appear in all past replay schemes

Page 20: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

DeLorean Replay

Virtual machine monitor loads initial system checkpoint

Interrupt Log provides initial-execution interrupts

I/O Log provides initial-execution I/O inputs

Processors execute normally (they execute in parallel):

Use CS Log to generate chunks with exceptional sizes (from cache overflows)

Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)

20

Page 21: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Also in the paper

Rules for deterministic chunking

How to deal with interrupts, traps, cache overflows, I/O ops

Scalability of Pico

A detailed view of DeLorean’s architecture

Proof of why DeLorean’s replay is deterministic

Combination with other HW replay schemes

21

Page 22: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

delorean

Pablo Montesinos DeLorean

Outline

Background on replay schemes

DeLorean: Chunk-Based replay scheme

Advanced DeLorean Execution

DeLorean Replay

Evaluation

22

Page 23: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Evaluation Setup

Chunk-based simulator supports both execution and replay

8-processor CMP

1000, 2000, 3000 instructions per chunk

PI Log: 4-bit entries

CS Log: 32-bit entries

Applications: SPLASH, SPECjbb 2000 and SPECweb 2005

23

Page 24: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

0

2

4

6

8

10

1000 2000 30001000 2000 30000

2

4

6

8

10

1000 2000 3000

Log Space Requirements

Nano’s log is 16% of RTR’s (2000-instruction configuration)

With one additional optimization, Nano’s log is 7.5% of RTR’s

Picos’s log is 0.6% of RTR’s (1000-instruction configuration)

PI log (Compressed)CS log (Compressed)

Log

size

(bits

/cor

e/ki

lo-in

st)

Average RTRcompressedlog size(estimated)

1000 2000 3000

CS log

Instructions per chunk

Average RTRcompressedlog size(estimated)

Instructions per chunk

Log

size

(bits

/cor

e/ki

lo-in

st)

PicoNano

24

(Compressed)

Page 25: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

If We Were To Record a Day of Execution

FDR RTR Nano Pico

1.6

60.0

400.0

GB

/Cor

e/D

ay

400

300

200

100

0

Execution and replay of long periods of time is feasible

25

10000.0

Page 26: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Performance

0

0.2

0.4

0.6

0.8

1.0

Spe

edup

Initial Execution

Initial execution: Nano almost as fast as RC, Pico faster than SC

Replay: comparable to initial execution speed

Initial and Replay Execution

SCPicoNanoRC Pico ReplayNano Replay

26

R PO SR R

Page 27: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Conclusions

DeLorean is a new approach to HW replay of parallel applications

Advantages over traditional HW replay systems:

Executes at high speed (like most aggressive consistency models)

Replays at comparable speed

Summarizes execution into a very small log (0.6% of current schemes)

Great help for programmers that need to debug parallel codes

27

Page 28: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

Pablo Montesinos DeLorean

Future Work

Adapt DeLorean to conventional multiprocessors without chunk-based execution

Take conventional replay schemes, use very aggressive SC implementations, and compare to DeLorean

Combine the best aspects of the different recording approaches (DeLorean, RTR and Strata)

28

Page 29: Isca Presentation Final - Illinoisiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_isca08_1.pdfIsca Presentation Final Author Pablo Montesinos Created Date 11/28/2010 3:58:55 PM

DeLorean: Recording and Deterministically Replaying Shared

Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Department of Computer Science and EngineeringUniversity of Washington