26
Luiz DeRose ([email protected]) © Cray Inc.

Luiz DeRose ([email protected]) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Luiz DeRose ([email protected]) © Cray Inc.

Page 2: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Paradyn Group – University of Wisconsin

F t T h l i G / ORNLFuture Technologies Group / ORNL

ADEPT / LLNLADEPT / LLNL

Monash Universityy

Allinea

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 2

Page 3: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Peta-scale systems with hundreds of thousands of threads need a new debugging paradigm

• Innovative techniques for productivity and scalabilityScalable Solutions based on MRNeto STAT - Stack Trace Analysis Toolo STAT Stack Trace Analysis Tool

» Scalable generation of a single, merged, stack backtrace treeo ATP - Abnormal Termination Processing

» Scalable analysis of a sick application, delivering a STAT tree and a minimal, h i filcomprehensive, core file set.

Comparative debuggingo A data-centric paradigm instead of the traditional control-centric paradigmFast Track DebuggingFast Track Debuggingo Debugging optimized applications

• Support for traditional debugging mechanismpp gg gTotalView, DDT, and gdb

Luiz DeRose ([email protected]) © Cray Inc.August 2, 2010 3

Page 4: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Tree based software overlay network

Provides efficient multicast and reduction communications for parallel and distributed tools

Uses a tree of processes between the tool's front-end and back-ends to improve group communication performance • Internal processes are used to distribute important tool activities

Reduce data analysis time Keep tool front-end loads manageable

August 2, 2010 4Luiz DeRose ([email protected]) © Cray Inc.

Page 5: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

STAT - Stack Trace Analysis Tool• Scalable generation of a single, merged, stack backtrace tree

128K processes analyzed in 2 7 seconds using MRNet128K processes analyzed in 2.7 seconds, using MRNetA comprehensible view of the entire applicationFocuses and directs debugger subset attach opportunities

ATP - Abnormal Termination Processing• Scalable analysis of a sick application, delivering a STAT tree and a

minimal, comprehensive, core file setMinimal impact on application runAutomated, transparent collection of dataAbility to hold failing application for close inspection

Luiz DeRose ([email protected]) © Cray Inc.August 2, 2010 5

Page 6: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Applications on Cray systems use hundreds of thousands of processes

When a large scale parallel application dies, one, many, or all processes might trap!processes might trap!• It is next to impossible to examine all the core files and backtraces

No one wants that many stack backtracesNo one wants that many core filesNo one wants that many core fileso They are too slow and too big

» Sufficient storage for all core files is a problemThey are too much to comprehend

• A single core file or stack backtrace is usually not enough to debug either!

A single backtrace produced might not be from the process that first failed

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 6

Page 7: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

ATP produces a single merged stack trace • or a reduced set of core files (future work)

The benefits:• Easy to navigate the merged stack trace • Manageable set of core files• Manageable set of core files• Reduced amount of data saved

Especially true in the core file situation

Requirements:• Minimum jitter• Scalability• Robustness• Small footprint

Li it d fil d i• Limited core file dumping

August 2, 2010 7Luiz DeRose ([email protected]) © Cray Inc.

Page 8: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

System of light weight back-end monitor processes on compute nodes• Coupled together with MRNet• Coupled together with MRNet

Leap into action on any application process trappingp y pp p pp g

STAT like analysis provides merged stack backtrace tree• Leaf nodes of tree define a modest set of processes to core dump

or, a set of processes to attach to with a debugger

Uses StackwalkerAPI to collect individual stack backtraces, even for optimized code

August 2, 2010 8Luiz DeRose ([email protected]) © Cray Inc.

Page 9: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Write Modify Port Compile

& Link NormalApp runs(verification)

& Link

Debug

Normal Termination

App runs(production)

Optimize

ATP

ExitAbnormal

Termination

Stacktrace(atpMergedBT

.dot)ATP

ExitAbnormal

Termination

STATview

ATP

Stacktrace(atpMergedBT

.dot)

STATview

)

August 2, 2010 9Luiz DeRose ([email protected]) © Cray Inc.

Page 10: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Application process signal handlero Triggers analysiso Controls its own

Compute Node

o Controls its own RLIMIT_CORE and core_pattern

App rank 

App rank 1ATP

Sig Handler

ATP back‐end

Main

Select( )

Back-end monitoro Collects backtraces via

StackwalkerAPIApp rank 

3

2ATP

Sig Handler

ATPSi H dl

fd

MRNet SignalHandlerStackwalkerAPI

o Forces core dumps as directedApp rank 

4

Sig Handler

ATPSig Handler

Handler

Front-end controllero Coordinates analysis via

MRNet

ATP back‐end

ATP back‐end

o Selects process set that is to dump core

August 2, 2010 10Luiz DeRose ([email protected]) © Cray Inc.

Page 11: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

ATP is launched via an ALPS enhancement which includes the fork/exec of a login side ATP front-end daemon• The ATP front-end uses MRNet and the ALPS tool helper library to launch ATP

back-end servers on all compute nodes associated with the applicationp pp

ATP signal handler runs within an application to catch fatal errors• It handles the following signals:

SIGQUIT SIGILL SIGTRAP SIGABRT SIGFPE SIGBUS SIGSEGV SIGSYSSIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGFPE, SIGBUS, SIGSEGV, SIGSYS, SIGXCPU, SIGXFSZSetting the environment variables MPICH_ABORT_ON_ERROR and SHMEM_ABORT_ON_ERROR will cause a signal to be thrown and captured for MPI and SHMEM fatal errors

ATP daemon running on the compute node captures signals, starts termination processing• Rest of the application processes are notifiedRest of the application processes are notified• Generates a stacktrace• Creates a single merged stack trace file

Th k fil i i d i h h STAT i lThe stack trace file is viewed with the STATview tool

August 2, 2010 11Luiz DeRose ([email protected]) © Cray Inc.

Page 12: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

August 2, 2010 12Luiz DeRose ([email protected]) © Cray Inc.

Page 13: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Released May 20, 2010

Manual (rather than automatic)• Must add code to application• Must link speciallyp y• Must explicitly launch (CLE 2.2)

P id b kt f fi t h t tdProvide backtrace of first crash to stderr

Provides merged backtrace treeProvides merged backtrace tree

Tested at 1024 PEsTested at 1024 PEs

August 2, 2010 13Luiz DeRose ([email protected]) © Cray Inc.

Page 14: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Automatic invocation of ATP• Today users need to insert signal handler • With next release of OS, just need to load atp module, j p

Hold a dying application in stasis• Gives the user an opportunity to attach a debugger to the application• Gives the user an opportunity to attach a debugger to the application

Send email notification to user that job has failedj

Improved scalability• ATP stack backtraces have been produced on applications made up of• ATP stack backtraces have been produced on applications made up of

about 2000 processes• Expect to be able to handle applications with 100,000s of processes

in the future

August 2, 2010 14Luiz DeRose ([email protected]) © Cray Inc.

Page 15: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Direct control of millions of threads is unworkable• Need to narrow down to a limited number of threads• Direct control of limited threads once narrowed down• Direct control of limited threads, once narrowed down

Narrowing down requires a different approachg q pp• Data-centric view, instead of the traditional control flow view

In a data-centric view users make statements about the contents of the data structures and the veracity of these are checked by the debugger

August 2, 2010 15Luiz DeRose ([email protected]) © Cray Inc.

Page 16: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Originally from Monash University in Australia

Helps the programmer locate errors in the program by observing p p g p g y gthe divergence in key data structures as the programs are executing

Allows comparison of a “suspect” program against a “reference”Allows comparison of a “suspect” program against a “reference” code using assertions• Simultaneous execution of both• Ability to assert the match of data at given points in executiony g p• Focus on data – not state and internal operations• Narrow down problem without massive thread study

Data centric – no additional complexity as scale increases

Cray and Monash University are working together under a grant f h A li l bili hfrom the Australian government on scalability research.

August 2, 2010 16Luiz DeRose ([email protected]) © Cray Inc.

Page 17: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Data assertions are the heart of comparative debugging• Assert that data1 at location1 matches data2 at location2

The comparative debugger verifies this assertion as the applications executepp

When the assertion fails, one has a specific region of the failing application to inspect• This often means that the user iterates with a refined assertion to

further narrow the search area

Assertions can be simple (scalar to scalar) or more complex ( i l t ll l ltidi i l )(serial to parallel multidimensional arrays)

August 2, 2010 17Luiz DeRose ([email protected]) © Cray Inc.

Page 18: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Serial converted to parallelSmall scaling versus large scalingOne optimization level versus anotherOne programming language converted to another. . .

RDClient

RDClientClient

RDServer

RDServer

RDServer

Opt 2Opt 1

Client

RDServer

RDServer

Parallel 1

Parallel 2

Serial

ApplicationApplication

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 18

Serial App Parallel App

Page 19: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

MM5 Weather modelSerial versus parallelDifference >0 1%Difference >0.1%Narrowed to missing term in equation… term found and added in

Courtesy of David Abramson from Monash University

August 2, 2010 19Luiz DeRose ([email protected]) © Cray Inc.

Page 20: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

E i ti i l t ti f tiExisting implementation of comparative debugger running on Cray XT and XE Systems

M dif i t i l bilitModifying to improve scalability• Add process sets as primitive data types• Implement set operations for setting breakpoints, etc• Considering MRNET tree for gathering breakpoint ?Considering MRNET tree for gathering breakpoint

events• Removing reliance on socket connections to debug

servers• Simplified original data decomposition

Head node

Compared

Signature

Simplified original data decomposition specification language to match current high level languages

E i ti ith lt ti i

Interconnection Network

Hash

Combined

Experimenting with alternative comparison techniques• Computing signatures on decomposed data

structures rather than always importing data for

ComputeNode

ComputeNode

ComputeNode

ComputeNode

DebugServer

Application

DebugServer

Applicationy p g

comparison Data Data

Hash Hash

August 2, 2010 20Luiz DeRose ([email protected]) © Cray Inc.

Page 21: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

How to debug parallel optimized codes

Debug flags eliminate optimizations• Today's machines really need optimizations• Slows down execution• Problem might disappear

Fast Track Debugging addresses this problem

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 21

Page 22: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Compile such that both debug and non-debug (optimized) versions of each routine are created• Debug and non debug versions of each subroutine appear in the• Debug and non-debug versions of each subroutine appear in the

executable

Linkage such that optimized versions are used by default

U t b k i t th d b t tUser sets breakpoints or other debug constructs• Debugger overrides default linkage when setting breakpoints and

stepping into functions• Routines automatically presented using the debug version of the

routine• Rest of program executes using optimized versions of the routines

August 2, 2010 22Luiz DeRose ([email protected]) © Cray Inc.

Page 23: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Support available in the Cray Compiling Environment

Prototype in gdb• Exercised through lgdb

Added to Allinea's DDT 2.6 (June 2010)

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 23

Page 24: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

August 2, 2010 Luiz DeRose ([email protected]) © Cray Inc. 24

Page 25: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Compiles are slower

Executable uses more disk space

I li i t d ffInlining turned off• 1.7% average slow down of all SPEC2007MPI tests• Range of slight speedup to 19.5% slow down

Uses more memory• 4% larger at start up• 4% larger at start up• 0.0001% larger after computation

August 2, 2010 25Luiz DeRose ([email protected]) © Cray Inc.

Page 26: Luiz DeRose (ldr@cray.com) © Cray Inc.cscads.rice.edu/DeRose-CScADS-2010.pdf · 2018-04-26 · Tree based software overlay network Provides efficient multicast and reduction communications

Luiz DeRose ([email protected]) © Cray Inc.