Upload
silas-dickerson
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Advanced High Performance
Computing WorkshopHPC 201
Dr. Charles J AntonelliLSAIT RSG
October 30, 2013
cja 2013 2
RoadmapFlux review
More advanced troubleshooting
Array & dependent scheduling
Graphical output
GPUs on Flux
Scientific applicationsR, Python, MATLAB
Parallel programming
Debugging & tracing2
11/13
cja 2013 6
Programming Models
Two basic parallel programming modelsMessage-passingThe application consists of several processes running on different nodes and communicating with each other over the network
Used when the data are too large to fit on a single node, and simple synchronization is adequate
“Coarse parallelism”
Implemented using MPI (Message Passing Interface) libraries
Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives
Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable
“Fine-grained parallelism” or “shared-memory parallelism”
Implemented using OpenMP (Open Multi-Processing) compilers and libraries
Both
10/13
cja 2013 7
Using Flux
Three basic requirements:A Flux login accountA Flux allocationAn MToken (or a Software Token)
Logging in to Fluxssh flux-login.engin.umich.eduCampus wired or MwirelessVPNssh login.itd.umich.edu first
10/13
cja 2013 8
Cluster batch workflow
You create a batch script and submit it to PBS
PBS schedules your job, and it enters the flux queue
When its turn arrives, your job will execute the batch script
Your script has access to all Flux applications and data
When your script completes, anything it sent to standard output and error are saved in files stored in your submission directory
You can ask that email be sent to you when your jobs starts, ends, or aborts
You can check on the status of your job at any time,or delete it if it’s not doing what you want
A short time after your job completes, it disappears from PBS
10/13
cja 2013 9
Loosely-coupled batch script
#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l procs=12,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe
#Your Code Goes Below:cat $PBS_NODEFILEcd $PBS_O_WORKDIRmpirun ./c_ex01
10/13
cja 2013 10
Tightly-coupled batch script
#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l nodes=1:ppn=12,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe
#Your Code Goes Below:cd $PBS_O_WORKDIRmatlab -nodisplay -r script
10/13
cja 2013 11
Copying dataThree ways to copy data to/from Flux
Use scp from login server:scp flux-login.engin.umich.edu:hpc201/example563.png .
Use scp from transfer host:scp flux-xfer.engin.umich.edu:hpc201/example563.png .
Use Globus Connect
10/13
12
Globus OnlineFeatures
High-speed data transfer, much faster than SCP or SFTP
Reliable & persistent
Minimal client software: Mac OS X, Linux, Windows
GridFTP EndpointsGateways through which data flow
Exist for XSEDE, OSG, …
UMich: umich#flux, umich#nyx
Add your own client endpoint!
Add your own server endpoint: contact [email protected]
More informationhttp://cac.engin.umich.edu/resources/login-nodes/globus-gridftp
10/13cja 2013
cja 2013 13
Job Arrays• Submit copies of identical jobs• Invoked via qsub –t:
qsub –t array-spec pbsbatch.txt
Where array-spec can be
m-n
a,b,c
m-n%slotlimit
e.g.
qsub –t 1-50%10 Fifty jobs, numbered 1 through 50,
only ten can run simultaneously
• $PBS_ARRAYID records array identifier
1311/13
cja 2013 14
Dependent scheduling
• Submit jobs whose scheduling depends on other jobs
• Invoked via qsub –W:qsub -W depend=type:jobid[:jobid]…
Where depend can beafter Schedule this job after jobids have startedafterany Schedule this job after jobids have finishedafterok Schedule this job after jobids have finished with no errors
afternotok Schedule this job after jobids have finished with errors
before jobids scheduled after this job startsbeforeanyjobids scheduled after this job completes
beforeok jobids scheduled after this job completes with no errorsbeforenotok jobids scheduled after this job completes with errors
1411/13
cja 2013 15
Troubleshootingshowq [-r][-i][-b][-w user=uniq] # running/idle/blocked jobs
qstat -f jobno # full info
qstat -n jobno # nodes/cores where job running
diagnose -p # job prio and components
pbsnodes # nodes, states, properties
pbsnodes -l # list nodes marked down
checkjob [-v] jobno # why job jobno not running
mdiag -a # allocs & users (flux)
tracejob jobno # info for compl jobs (not flux)
freenodes # aggregate node/core busy/free
mdiag -u uniq # allocs for uniq (flux)
mdiag -a alloc_flux # cores active, alloc (flux)
11/13
cja 2013 20
Debugging with GDB
Command-line debuggerStart programs or attach to running programsDisplay source program linesDisplay and change variables or memoryPlant breakpoints, watchpointsExamine stack frames
Excellent tutorial documentationhttp://www.gnu.org/s/gdb/documentation/
2011/13
cja 2013 21
GDB symbolsGDB requires program symbols to be generated by the compiler
GDB will work without symbolsBut you’d better be fluent in machine instructions and hexadecimal
Add –g flag to your compilation
gcc –g hello.c –o chello
gfortran –f hello.f90 –o fhello
Do not use –O with –g
Most compilers won’t optimize code for debugging
gcc and gfortran will, but you often won’t recognize the resulting source code
2111/13
cja 2013 22
Useful GDB commandsgdb exec start gdb on executable execgdb exec core start gdb on executable exec with core file corel [m,n] list sourcedisas disassemble function enclosing current instructiondisas func disassemble function funcb func set breakpoint at entry to funcb line# set breakpoint at source line#b *0xaddr set breakpoint at address addri b show breakpointsd bp# delete beakpoint bp#r [args] run program with optional argsbt show stack backtracec continue execution from breakpointstep single-step one source linenext single-step, don’t step into functionstepi single-step one instructionp var display contents of variable varp *var display value pointed to by varp &var display address of varp arr[idx] display element idx of array arrx 0xaddr display hex word at addrx *0xaddr display hex word pointed to by addrx/20x 0xaddr display 20 words in hex starting at addri r display registersi r ebp display register ebpset var = expression set variable var to expressionq quit gdb 11/13
cja 2013 23
Resourceshttp://cac.engin.umich.edu/started
Cluster news, RSS feed and outages listed here
http://cac.engin.umich.edu/Getting an account, training, basic tutorials
http://cac.engin.umich.edu/resources/systems/flux/Getting an allocation, Flux On-Demand, Flux info
For assistance: [email protected] by a team of people
Cannot help with programming questions, but can help with scheduler issues
11/13
cja 2013 24
References1. CAC supported Flux software,
http://cac.engin.umich.edu/resources/software/index.html, (accessed August 2011)
2. Free Software Foundation, Inc., “GDB User Manual,” http://www.gnu.org/s/gdb/documentation/ (accessed October 2011).
3. Infiniband, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011).4. Intel C and C++ Compiler 1.1 User and Reference Guide,
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011).
5. Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011).
6. Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011).
7. Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011).
11/13