47
Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression and Replay of Communication Traces in Massively Parallel Environments

Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

Embed Size (px)

DESCRIPTION

3 Introduction Contemporary HPC Systems — Size > 1000 processors — take IBM Blue Gene/L: 64k processors Challenges on HPC Systems (large-scale scientific applications) — Communication scaling (MPI) — Communication analysis — Task mapping Procurements also require performance prediction of future systems

Citation preview

Page 1: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

Mike Noeth, Frank MuellerNorth Carolina State University

Martin Schulz, Bronis R. de SupinskiLawrence Livermore National Laboratory

Scalable Compression and Replay of Communication Traces in Massively

Parallel Environments

Page 2: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

2

Outline

Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work

Page 3: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

3

Introduction

Contemporary HPC Systems— Size > 1000 processors— take IBM Blue Gene/L: 64k processors

Challenges on HPC Systems (large-scale scientific applications)

— Communication scaling (MPI)— Communication analysis— Task mapping

Procurements also require performance prediction of future systems

Page 4: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

4

Communication Analysis Existing approaches and short comings

— Source code analysis+ Does not require machine time- Wrong abstraction level: source code often complicated- No dynamic information

— Lightweight statistical analysis (mpiP)+ Low instrumentation cost- Less information available (i.e. aggregate metrics only)

— Fully captured (lossless) (Vampir, VNG)+ Full trace available for offline analysis- Traces generated per task: not scalable- Gather only traces on subset of nodes- Use cluster for visualization

viz I/O nodes

Page 5: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

5

Our Approach

Trace-driven approach to analyze MPI communication

Goals— Extract entire communication trace— Maintain structure— Full replay (independent of original application)— Scalable— Lossless— Rapid instrumentation— MPI implementation independent

Page 6: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

6

Design Overview

Two Parts: Recording traces

— Use MPI profiling layer— Compress at the task-level— Compress across all nodes

Replaying traces

Page 7: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

7

Outline

Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work

Page 8: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

8

Intra-Node Compression Framework

Umpire [SC’00] wrapper generator for MPI profiling layer— Initialization wrapper— Tracing wrapper— Termination wrapper

Intra-node compression of MPI calls— Provides load scalability

Interoperability with cross-node framework Event aggregation

— Special handling of MPI_Waitsome Maintain structure of call sequences stack walk signatures

— XOR signature for speedXOR match necessary (not sufficient) same call sequence

Page 9: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

9

Intra-Node Compression Example

Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5

op1 op2 op3 op4 op5 op5

targethead

targettail

targettail

targettail

targettail

targettail

mergehead

mergehead

op3

targethead

mergetail

targettail

op4

targettail

mergetail

targethead

targettail

mergetail

targethead

matchmatchmatch

Page 10: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

10

Intra-Node Compression Example

Consider the MPI operation stream: op1, op2, op3, op4, op5, op3, op4, op5

Algorithm in the paper

op1 op2 op3 op4 op5(( ), iters = 2)

Page 11: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

11

Event Aggregation

Domain-specific encoding improve compression MPI_Waitsome (array_of_requests, incount, outcount, …)

blocks until one or more requests satisfied Number of Waitsome calls in loop is nondeterministic Take advantage of general usage to delay compression

— MPI_Waitsome does not compress until a different operation is executed

— Accumulate output parameter outcount

Page 12: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

12

Outline

Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work

Page 13: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

13

Inter-Node Framework Interoperability

Single Program, Multiple Data (SPMD) nature of MPI codes

Match operations across nodes by manipulating parameters— Source / destination offsets— Request offsets

Page 14: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

14

Location-Independent Encoding

Point-to-point communication specifies targets by MPI rank— MPI rank parameters will not match across tasks Use offsets instead

16 processor (4x4) 2D stencil example— MPI rank targets (source/destination)

–9 communicates with 8, 5, 10, 13–10 communicates with 9, 6, 11, 14

— MPI offsets–9 & 10 communicate with -1, -4, +1, +4

Page 15: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

15

Request Handles

Asynchronous MPI ops associated w/ a MPI_Request handle— Handle nondeterministic across tasks Parameter mismatch in inter-node framework

Handle w/ circular request buffer (pre-set size)— On asynchronous MPI operation, store handle in buffer— On lookup of handle, use offset of current position in

buffer— Requires special handling in replay mechanism

Current

H1

Lookup H1 = -2

H2

Current

H3

Current

Page 16: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

16

Inter-Node Compression Framework

Invoked after all computations done (in MPI_Finalize wrapper)

Merges operation queues produced by task-level framework Job size scalability

Task 0

Task 1

Task 2

Task 3

op1 op2 op3

op4 op5 op6

op1 op2 op3Match

op4 op5 op6

Page 17: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

17

Inter-Node Compression Framework

Invoked after all computations done (in MPI_Finalize wrapper)

Merges operation queues produced by task-level framework Job size scalability

There’s more: Relaxed reordering of events of different nodes (dependence check) paper …

Task 0

Task 1

Task 2

Task 3

op1 op2 op3

op4 op5 op6

op4 op5 op6Match

Page 18: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

18

Reduction over Binary Radix Tree

Cross-node framework merges operation queues of each task

Merge algorithm supports merging two queues at a time paper

Radix layout facilitates compression (constant stride b/w nodes)

Need a control mechanism to order merging process

Page 19: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

19

Outline

Introduction Intra-Node Compression Framework Inter-Node Compression Framework Replay Mechanism Experimental Results Conclusion and Future Work

Page 20: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

20

Replay Mechanism

Motivation— Can replay traces on any architecture

–Useful for rapid prototyping procurements–Communication tuning (Miranda -- SC’05)–Communication analysis (patterns)–Communication tuning (inefficiencies)

Replay design— Replays comprehensive trace produced by recording

framework— Parses trace, loads task-level op queues (inverse of merge

algo)— Replay on-the-fly (inverse of compression algo)— Timing deltas

Page 21: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

21

Experimental Results

Environment— 1024 node BG/L at Lawrence Livermore National Labs— Stencil micro-benchmarks— Raptor real world application

[Greenough’03]

Page 22: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

22

Trace System Validation

Uncompressed trace dumps compared Replay

— Matching profiles from mpiP

Page 23: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

23

Task Scalability Performance

Varied:— Strong (task) scaling: number of nodes

Examined metrics:— Trace file size— Memory usage— Compression time (or write time)

Results for— 1/2/3D stencils— Raptor application— NAS Parallel Benchmarks

Page 24: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

24

Trace File Size – 3D Stencil Constant size for fully compressed traces (inter-node) Log scale

100MB

0.5MB

10kB

Page 25: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

25

Memory Usage – 3D Stencil Constant memory usage for fully compressed (inter-node)

— min = leaves, avg = middle layer (decr. W/ node #), max ~ task 0

Average memory usage decreases w/ more processors0.5MB

Page 26: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

26

Trace File Size – Raptor App Sub-linear increase for fully compressed traces (inter-node) NOT on log scale

10kB

80MB

93MB

Page 27: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

27

Memory Usage – Raptor Constant memory usage for fully compressed (inter-node) Average memory usage decreases w/ more processors

500MB

Page 28: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

28

Load Scaling – 3D Stencil Both intra- and inter-node compression result in constant

size Log scale

Page 29: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

29

Trace File Size – NAS PB CodesLog scale file size [Bytes], 32-512 CPUs3 categories for none / intra- / inter-nodeFocus: blue = full compression Near-constant size (EP, also IS, DT)

— Instead of exponential Sub-linear (MG, also LU)

— Still good Non-scalable (FT, also BT, CG)

— Still 2-4 orders of magn. smaller— But could improve— Due to complex comm. Patterns

–along diagonal of 2D layout–Even varying # endpoints

Page 30: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

30

Memory Usage – NAS PB CodesLog scale memory [Bytes], 32-512 CPUs 3 categories for min, avg, max, root0 Near-constant size (EP, also IS, DT)

— Also constant in memory

Sub-linear (MG, also LU)— Sometimes constant in memory

Non-scalable (FT, also BT, CG)— Non-scalable in memory

Page 31: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

31

Compression/Write Overhead – NAS PBLog scale time [ms], 32-512 CPUs none / intra- / inter-node (full compression) Near-constant size (EP, also IS, DT)

— Inter-node compression fastest

Sub-linear (LU, also MG)— Intra-node faster

than inter-node

Non-scalable (FT, also BT, CG)— Not competitive

better write with intra-nodecompression

Page 32: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

32

ConclusionContributions: Scalable approach to capturing full trace of communication

— Near constant trace sizes for some apps (others: more work)— Near constant memory requirement

Rapid analysis via replay mechanism Lossless MPI tracing of any number of node feasible

— May store and visualize MPI traces on your desktop

Future Work:— Task layout model (i.e. Miranda)— Post-analysis stencil identification— Tuning detect non-scalable MPI usage— Support for procurements— Scalable replay— Offload compression to I/O nodes

Page 33: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

33

Acknowledgements

Mike Noeth (NCSU) Martin Schulz (LLNL) Bronis R. de Supinski (LLNL) Prasun Ratn (NCSU)

Availability under BSD license:moss.csc.ncsu.edu/~mueller/scala.html

Funded in part by NSF CCF-0429653, CNS-0410203, CAREER CCR-0237570 Part of this work was performed under the auspices of the U.S. Department of Energy by

University of California Lawrence Livermore National Laboratory

under contract No. W-7405-Eng-48, UCRL-???

Page 34: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

34

Global Compression/Write Time [ms]

Average time per node Max. time

BT CG DT EP FT IS LU MG

Page 35: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

35

Trace File Size – Raptor Near constant size for fully compressed traces Linear scale

Page 36: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

36

Memory Usage – Raptor Results Same as 3D stencil Investigating the min memory usage for 1024 tasks

Page 37: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

37

Intra-Node Compression Algorithm

Intercept MPI callIdentify targetIdentify merger

Match verification

Compression

Page 38: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

38

Call Sequence ID: Stack Walk Signature

Maintain structure by distinguishing between operations XOR signature speed

— XOR match necessary (not sufficient) same calling context

Page 39: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

39

Inter-Node Merge Algorithm

Iterate throughboth queues

Find matchMaintain orderCompress ops

Page 40: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

40

Consider two tasks each with their own operation queue

Inter-Node Merge Example

MasterQueue

SlaveQueue

Sequence 1

Participants:Task 0

Sequence 4

Participants:Task 0

Sequence 1

Participants:Task 1

Sequence 2

Participants:Task 1

Sequence 3

Participants:Task 1

Sequence 4

Participants:Task 1

MI

SI SH

Task 1

MATCH!

MasterQueue

SlaveQueue

Page 41: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

41

Consider two tasks each with their own operation queue

Inter-Node Merge Example

MasterQueue

SlaveQueue

Sequence 1

Participants:Task 0

Sequence 4

Participants:Task 0

Sequence 1

Participants:Task 1

Sequence 2

Participants:Task 1

Sequence 3

Participants:Task 1

Sequence 4

Participants:Task 1

MI

SH SI

Task 1

SI

MATCH!

MasterQueue

SlaveQueue

Page 42: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

42

Consider two tasks each with their own operation queue

Cross-Node Merge Example

MasterQueue

SlaveQueue

Sequence 1

Participants:Task 0

Sequence 4

Participants:Task 0

Sequence 2

Participants:Task 1

Sequence 3

Participants:Task 1

Sequence 4

Participants:Task 1

MI

SH SI

Task 1 Task 1

MasterQueue

SlaveQueue

MATCH!

Page 43: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

43

Temporal Cross-Node Reordering

Requirement: queue maintains order of operations Merge algorithm maintains order of operations too strictly

— Unmatched sequences in slave queue always moved to master

— Results in poorer compression Solution: only move operations that must be moved

— Intersect task participation lists of matched & unmatched ops

— Intersection empty no dependency— O/w ops must be moved

Page 44: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

44

Dependency Example

Consider the 4 task job (task 1 & 3 have already merged):

Sequence 1

Participants:Task 0

Sequence 2

Participants:Task 0

Sequence 2

Participants:Task 1

Sequence 1

Participants:Task 3

SI

MI

SH SI

MasterQueue

SlaveQueue

MATCH!

Page 45: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

45

Dependency Example

Consider the 4 task job (task 1 & 3 have already merged):

Sequence 1

Participants:Task 0

Sequence 2

Participants:Task 0

Sequence 2

Participants:Task 1

Sequence 1

Participants:Task 3

MI

SH SI

MasterQueue

SlaveQueue

Page 46: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

46

Dependency Example

Consider the 4 task job (task 1 & 3 have already merged):

Sequence 1

Participants:Task 0

Sequence 2

Participants:Task 0

Sequence 2

Participants:Task 1

Sequence 1

Participants:Task 3

MI

SH SI

MasterQueue

SlaveQueue

Task 3

Duplicates

Page 47: Mike Noeth, Frank Mueller North Carolina State University Martin Schulz, Bronis R. de Supinski Lawrence Livermore National Laboratory Scalable Compression

47

Dependency Example

Consider the 4 task job (task 1 & 3 have already merged):

Sequence 1

Participants:Task 0

Sequence 2

Participants:Task 0

Sequence 2

Participants:Task 1

Sequence 1

Participants:Task 3

MI

SH SI

MasterQueue

SlaveQueue

Task 3 Task 1