24
Advanced High Performance Computing Workshop HPC 201 Dr. Charles J Antonelli LSAIT RSG October 30, 2013

Advanced High Performance Computing Workshop HPC 201 Dr. Charles J Antonelli LSAIT RSG October 30, 2013

Embed Size (px)

Citation preview

Advanced High Performance

Computing WorkshopHPC 201

Dr. Charles J AntonelliLSAIT RSG

October 30, 2013

cja 2013 2

RoadmapFlux review

More advanced troubleshooting

Array & dependent scheduling

Graphical output

GPUs on Flux

Scientific applicationsR, Python, MATLAB

Parallel programming

Debugging & tracing2

11/13

cja 2013 3

Flux review

11/13

cja 2013 4

The Flux clusterLogin nodes Compute nodes

Storage…

Data transfernode

10/13

cja 2013 5

A Flux node

12-40 Intel cores

48 GB – 1 TB RAM

Local disk

10/13

8 GPUs (optional)

cja 2013 6

Programming Models

Two basic parallel programming modelsMessage-passingThe application consists of several processes running on different nodes and communicating with each other over the network

Used when the data are too large to fit on a single node, and simple synchronization is adequate

“Coarse parallelism”

Implemented using MPI (Message Passing Interface) libraries

Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives

Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable

“Fine-grained parallelism” or “shared-memory parallelism”

Implemented using OpenMP (Open Multi-Processing) compilers and libraries

Both

10/13

cja 2013 7

Using Flux

Three basic requirements:A Flux login accountA Flux allocationAn MToken (or a Software Token)

Logging in to Fluxssh flux-login.engin.umich.eduCampus wired or MwirelessVPNssh login.itd.umich.edu first

10/13

cja 2013 8

Cluster batch workflow

You create a batch script and submit it to PBS

PBS schedules your job, and it enters the flux queue

When its turn arrives, your job will execute the batch script

Your script has access to all Flux applications and data

When your script completes, anything it sent to standard output and error are saved in files stored in your submission directory

You can ask that email be sent to you when your jobs starts, ends, or aborts

You can check on the status of your job at any time,or delete it if it’s not doing what you want

A short time after your job completes, it disappears from PBS

10/13

cja 2013 9

Loosely-coupled batch script

#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l procs=12,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe

#Your Code Goes Below:cat $PBS_NODEFILEcd $PBS_O_WORKDIRmpirun ./c_ex01

10/13

cja 2013 10

Tightly-coupled batch script

#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l nodes=1:ppn=12,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe

#Your Code Goes Below:cd $PBS_O_WORKDIRmatlab -nodisplay -r script

10/13

cja 2013 11

Copying dataThree ways to copy data to/from Flux

Use scp from login server:scp flux-login.engin.umich.edu:hpc201/example563.png .

Use scp from transfer host:scp flux-xfer.engin.umich.edu:hpc201/example563.png .

Use Globus Connect

10/13

12

Globus OnlineFeatures

High-speed data transfer, much faster than SCP or SFTP

Reliable & persistent

Minimal client software: Mac OS X, Linux, Windows

GridFTP EndpointsGateways through which data flow

Exist for XSEDE, OSG, …

UMich: umich#flux, umich#nyx

Add your own client endpoint!

Add your own server endpoint: contact [email protected]

More informationhttp://cac.engin.umich.edu/resources/login-nodes/globus-gridftp

10/13cja 2013

cja 2013 13

Job Arrays• Submit copies of identical jobs• Invoked via qsub –t:

qsub –t array-spec pbsbatch.txt

Where array-spec can be

m-n

a,b,c

m-n%slotlimit

e.g.

qsub –t 1-50%10 Fifty jobs, numbered 1 through 50,

only ten can run simultaneously

• $PBS_ARRAYID records array identifier

1311/13

cja 2013 14

Dependent scheduling

• Submit jobs whose scheduling depends on other jobs

• Invoked via qsub –W:qsub -W depend=type:jobid[:jobid]…

Where depend can beafter Schedule this job after jobids have startedafterany Schedule this job after jobids have finishedafterok Schedule this job after jobids have finished with no errors

afternotok Schedule this job after jobids have finished with errors

before jobids scheduled after this job startsbeforeanyjobids scheduled after this job completes

beforeok jobids scheduled after this job completes with no errorsbeforenotok jobids scheduled after this job completes with errors

1411/13

cja 2013 15

Troubleshootingshowq [-r][-i][-b][-w user=uniq] # running/idle/blocked jobs

qstat -f jobno # full info

qstat -n jobno # nodes/cores where job running

diagnose -p # job prio and components

pbsnodes # nodes, states, properties

pbsnodes -l # list nodes marked down

checkjob [-v] jobno # why job jobno not running

mdiag -a # allocs & users (flux)

tracejob jobno # info for compl jobs (not flux)

freenodes # aggregate node/core busy/free

mdiag -u uniq # allocs for uniq (flux)

mdiag -a alloc_flux # cores active, alloc (flux)

11/13

cja 2013 16

GPUs on Flux

11/13

cja 2013 17

Scientific applications

11/13

cja 2013 18

Parallel programming

11/13

cja 2013 19

Debugging & tracing

11/13

cja 2013 20

Debugging with GDB

Command-line debuggerStart programs or attach to running programsDisplay source program linesDisplay and change variables or memoryPlant breakpoints, watchpointsExamine stack frames

Excellent tutorial documentationhttp://www.gnu.org/s/gdb/documentation/

2011/13

cja 2013 21

GDB symbolsGDB requires program symbols to be generated by the compiler

GDB will work without symbolsBut you’d better be fluent in machine instructions and hexadecimal

Add –g flag to your compilation

gcc –g hello.c –o chello

gfortran –f hello.f90 –o fhello

Do not use –O with –g

Most compilers won’t optimize code for debugging

gcc and gfortran will, but you often won’t recognize the resulting source code

2111/13

cja 2013 22

Useful GDB commandsgdb exec start gdb on executable execgdb exec core start gdb on executable exec with core file corel [m,n] list sourcedisas disassemble function enclosing current instructiondisas func disassemble function funcb func set breakpoint at entry to funcb line# set breakpoint at source line#b *0xaddr set breakpoint at address addri b show breakpointsd bp# delete beakpoint bp#r [args] run program with optional argsbt show stack backtracec continue execution from breakpointstep single-step one source linenext single-step, don’t step into functionstepi single-step one instructionp var display contents of variable varp *var display value pointed to by varp &var display address of varp arr[idx] display element idx of array arrx 0xaddr display hex word at addrx *0xaddr display hex word pointed to by addrx/20x 0xaddr display 20 words in hex starting at addri r display registersi r ebp display register ebpset var = expression set variable var to expressionq quit gdb 11/13

cja 2013 23

Resourceshttp://cac.engin.umich.edu/started

Cluster news, RSS feed and outages listed here

http://cac.engin.umich.edu/Getting an account, training, basic tutorials

http://cac.engin.umich.edu/resources/systems/flux/Getting an allocation, Flux On-Demand, Flux info

For assistance: [email protected] by a team of people

Cannot help with programming questions, but can help with scheduler issues

11/13

cja 2013 24

References1. CAC supported Flux software,

http://cac.engin.umich.edu/resources/software/index.html, (accessed August 2011)

2. Free Software Foundation, Inc., “GDB User Manual,” http://www.gnu.org/s/gdb/documentation/ (accessed October 2011).

3. Infiniband, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011).4. Intel C and C++ Compiler 1.1 User and Reference Guide,

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011).

5. Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011).

6. Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011).

7. Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011).

11/13