29
DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently Pablo Montesinos, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu *Department of Computer Science and Engineering University of Washington

Isca Presentation Final - Illinois · Isca Presentation Final Author: Pablo Montesinos Created Date: 11/28/2010 3:58:55 PM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • DeLorean: Recording and Deterministically Replaying Shared

    Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

    Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

    http://iacoma.cs.uiuc.edu

    *Department of Computer Science and EngineeringUniversity of Washington

  • Pablo Montesinos DeLorean

    Motivation

    Debugging parallel applications on a multicore is challenging:

    Bugs can be very timing dependent

    Their effects might manifest after many instructions

    Need effective debugging techniques for parallel applications on MPs

    One such technique: HW-assisted deterministic replay

    2

  • Pablo Montesinos DeLorean

    HW-Assisted Deterministic Replay

    Phase I: Initial execution

    HW records into a log the order of dependences between threads

    This log captures the “interleaving” of threads

    Phase II: Replay

    Re-run the parallel application

    Enforce the dependence orders stored in the log

    Bacon and Goldstein. WPDD 91

    3

  • Pablo Montesinos DeLorean

    Contributions

    Propose DeLorean: a new scheme for hw-assisted deterministic replay of parallel programs based on “chunk-based” execution

    Records initial execution at production-run speeds

    Requires only small log (0.6% of current schemes)

    Replays at high-speed

    Has multiple modes of execution with different speed/logging requirements

    4

  • delorean

    Pablo Montesinos DeLorean

    Outline

    Background on replay schemes

    DeLorean: a Chunk-based replay scheme

    Advanced DeLorean Execution

    DeLorean Replay

    Evaluation

    5

  • Pablo Montesinos DeLorean

    Point-to-Point Schemes: FDR and RTR

    Record dependences between individual instructions of different threads

    Optimizations to reduce log size

    Transitive reduction

    Regulated transitive reduction

    Use Sequential Consistency (SC)

    RTR proposes an algorithm to support TSO:

    Record value of loads that violate SC

    P1

    Wb

    P2

    Rb

    n

    m

    P2’s LogP2’s LogP2’s Log

    P1 n m

    FDR (ISCA 03)

    6

    RTR (ASPLOS 06)

  • Pablo Montesinos DeLorean

    Vector-based Schemes: Strata

    Stratum: vector of as many counters as processors:

    Count # of memory operations executed

    Log stores sequence of strata

    It also uses SC

    Log size is similar to RTR’s

    Strata (ASPLOS 06)

    Ra

    P1 P2 P3Wa

    Wc 1 1 2

    7

    Stratum

    RdRb

    Log

  • Pablo Montesinos DeLorean

    Episode-Based Schemes: ReRun8

    ReRun (ISCA 08)

  • delorean

    Pablo Montesinos DeLorean

    Outline

    Background on replay schemes

    DeLorean: a Chunk-based replay scheme

    Advanced DeLorean Execution

    DeLorean Replay

    Evaluation

    9

  • Pablo Montesinos DeLorean

    Chunk-Based Execution

    Memory ops can be reordered and overlapped within and across chunks

    Memory access interleaving happens at chunk boundaries

    Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)

    P1

    ld Xld Zld Xcommit

    P0

    st Xst Y

    P0 P1

    st Xst Y

    ld Tld Zst W

    re-execute

    ld Xld Zld X

    P0

    st Xst Y

    P1

    ld Tld Zst W

    Group consecutive dynamic instructions

    commit

    commit

    Execute chunks atomically

    Execute chunks in isolation

    chunk chunk

    10

    P0

    Limit possible interleavings

    Mem

    P1

    P0collision

  • Pablo Montesinos DeLorean

    A Great Substrate for MP Replay

    Overlapping/reordering enables high-speed execution and replay

    Performance similar to a system with Release Consistency (RC)

    Limited memory interleaving minimizes logging requirements

    No need to record individual dependences

    11

  • Pablo Montesinos DeLorean

    Chunk-Based Replay: DeLorean

    Replaying a chunk-based system means:

    Generate same chunks during initial execution and replay

    And commit them in the same order

    DeLorean’s Log stores total order of chunk commits and their size

    DeLorean’s log is very small:

    Each entry is short: (ProcID, ChunkSize)

    Log updated infrequently

    Aggressive designs can make it even smaller

    12

    commit

    st Xst Y

    P1P0

    ld Ald Bld Ccommitst A

    st Xld Zst Zst T

    commit

    ld Yst Bst Acommit

    Log

    P0 2

    P1 3

    P1 3

    P0 5

  • Pablo Montesinos DeLorean

    Basic DeLorean Operation13

    Proc 0

    Commit Request Commit Request

    0

    1

    Ok Ok

    PI Log

    CS LogA Size

    ProcID = 0

    ProcID = 1

    Chunk A

    Proc 1

    CS LogB Size

    Chunk BArbiter

    Log implemented as two different structures:

    Chunk Size (CS) Log

    Processor Interleaving (PI) Log

  • delorean

    Pablo Montesinos DeLorean

    Outline

    Background on replay schemes

    DeLorean: a Chunk-Based replay scheme

    Advanced DeLorean Execution

    DeLorean Replay

    Evaluation

    14

  • Pablo Montesinos DeLorean

    500

    500

    225

    500

    500

    75

    500

    450

    300

    500

    100

    400

    325

    Reducing DeLorean’s Log

    Use large chunks1

    Fixed-size chunking2

    Predefined interleaving3

    0

    1

    1

    0

    0

    1

    0

    1

    0

    0

    0

    1

    1

    2000

    1660

    1580

    1250

    PI LogP0

    CS Log

    0

    1

    0

    0

    1

    1

    0

    2000

    2000

    2000

    2000

    15

    P1CS Log

    1850

    2000

    1950

    2000

    2000

    2000

    0

    1

    0

    1

    0

    1

    0

  • Pablo Montesinos DeLorean

    Reducing DeLorean’s Log

    Use large chunks1Fewer entries in log

    More collisions and cache overflows

    +

    Fixed-size chunking2No CS log

    Need to handle cache overflows

    +

    Predefined interleaving3No PI log

    Performance degradation

    +

    Almost no

    16

  • Pablo Montesinos DeLorean

    DeLorean Execution Modes

    PicoNano

    17

    Predefined interleaving3

    Fixed-size chunking2

    Use large chunks1

    Fixed-size chunking2

    Use large chunks1

    (tiny CS log) (tiny CS log)

    (no PI Log)

  • delorean

    Pablo Montesinos DeLorean

    Outline

    Background on replay schemes

    DeLorean: Chunk-Based replay scheme

    Advanced DeLorean Execution

    DeLorean Replay

    Evaluation

    18

  • Pablo Montesinos DeLorean

    DMA

    DMA Log

    Network

    Overall System

    Proc 0

    PI Log

    CS Log

    Chunk A Arbiter

    Interrupt Log

    I/O Log

    Proc 1

    CS Log

    Chunk B

    Interrupt Log

    I/O Log

    19

    Interrupt, I/O and DMA logs also appear in all past replay schemes

  • Pablo Montesinos DeLorean

    DeLorean Replay

    Virtual machine monitor loads initial system checkpoint

    Interrupt Log provides initial-execution interrupts

    I/O Log provides initial-execution I/O inputs

    Processors execute normally (they execute in parallel):

    Use CS Log to generate chunks with exceptional sizes (from cache overflows)

    Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)

    20

  • Pablo Montesinos DeLorean

    Also in the paper

    Rules for deterministic chunking

    How to deal with interrupts, traps, cache overflows, I/O ops

    Scalability of Pico

    A detailed view of DeLorean’s architecture

    Proof of why DeLorean’s replay is deterministic

    Combination with other HW replay schemes

    21

  • delorean

    Pablo Montesinos DeLorean

    Outline

    Background on replay schemes

    DeLorean: Chunk-Based replay scheme

    Advanced DeLorean Execution

    DeLorean Replay

    Evaluation

    22

  • Pablo Montesinos DeLorean

    Evaluation Setup

    Chunk-based simulator supports both execution and replay

    8-processor CMP

    1000, 2000, 3000 instructions per chunk

    PI Log: 4-bit entries

    CS Log: 32-bit entries

    Applications: SPLASH, SPECjbb 2000 and SPECweb 2005

    23

  • Pablo Montesinos DeLorean

    0

    2

    4

    6

    8

    10

    1000 2000 30001000 2000 30000

    2

    4

    6

    8

    10

    1000 2000 3000

    Log Space Requirements

    Nano’s log is 16% of RTR’s (2000-instruction configuration)

    With one additional optimization, Nano’s log is 7.5% of RTR’s

    Picos’s log is 0.6% of RTR’s (1000-instruction configuration)

    PI log (Compressed)CS log (Compressed)

    Log

    size

    (bits

    /cor

    e/ki

    lo-in

    st)

    Average RTRcompressedlog size(estimated)

    1000 2000 3000

    CS log

    Instructions per chunk

    Average RTRcompressedlog size(estimated)

    Instructions per chunk

    Log

    size

    (bits

    /cor

    e/ki

    lo-in

    st)

    PicoNano

    24

    (Compressed)

  • Pablo Montesinos DeLorean

    If We Were To Record a Day of Execution

    FDR RTR Nano Pico

    1.6

    60.0

    400.0

    GB

    /Cor

    e/D

    ay

    400

    300

    200

    100

    0

    Execution and replay of long periods of time is feasible

    25

    10000.0

  • Pablo Montesinos DeLorean

    Performance

    0

    0.2

    0.4

    0.6

    0.8

    1.0

    Spe

    edup

    Initial Execution

    Initial execution: Nano almost as fast as RC, Pico faster than SC

    Replay: comparable to initial execution speed

    Initial and Replay Execution

    SCPicoNanoRC Pico ReplayNano Replay

    26

    R PO SR R

  • Pablo Montesinos DeLorean

    Conclusions

    DeLorean is a new approach to HW replay of parallel applications

    Advantages over traditional HW replay systems:

    Executes at high speed (like most aggressive consistency models)

    Replays at comparable speed

    Summarizes execution into a very small log (0.6% of current schemes)

    Great help for programmers that need to debug parallel codes

    27

  • Pablo Montesinos DeLorean

    Future Work

    Adapt DeLorean to conventional multiprocessors without chunk-based execution

    Take conventional replay schemes, use very aggressive SC implementations, and compare to DeLorean

    Combine the best aspects of the different recording approaches (DeLorean, RTR and Strata)

    28

  • DeLorean: Recording and Deterministically Replaying Shared

    Memory Multiprocessor Execution EfficientlyPablo Montesinos, Luis Ceze* and Josep Torrellas

    Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

    http://iacoma.cs.uiuc.edu

    *Department of Computer Science and EngineeringUniversity of Washington