Upload
cory-carr
View
212
Download
0
Embed Size (px)
Citation preview
North Carolina Supercomputing Center
NCSCNCSC
Introduction to the Origin2400
North Carolina Supercomputing Center
NCSCNCSC
Course Outline
Origin2400 Architecture
Code development and optimization tools
Cache optimization
User Environment
North Carolina Supercomputing Center
NCSCNCSC
Memory Types
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory Memory
CPU CPU
CPUCPU
Distributed
Shared
North Carolina Supercomputing Center
NCSCNCSC
Origin2400 Architecture
ccNUMAcache coherent - Non-uniform
memory access
Physically distributed, globally addressable memory
Hardware cache coherence
Scalable shared memory
systems
Bus-basedshared memory
systems
Massively paralleldistributed memory
systems
Easy to programEasy to scale - to a point
Easy to programHard to scale
Hard to programEasy to scale
North Carolina Supercomputing Center
NCSCNCSC
Node
Two R12000 Processors (400MHz)
64MB-4GB memory (1 GB)
Hub (interface)
Hub
Memory
PP
North Carolina Supercomputing Center
NCSCNCSC
System Scaling
Hub
Memory
PP
Hu
b
Me
mo
ry
PP
Hub
Memory
PP
Hu
b
Me
mo
ry
PP
Hub
Memory
PP
R
North Carolina Supercomputing Center
NCSCNCSC
System Scaling
Hu
b
Me
mo
ry
PP
Hub
Memory
PP
R Hu
b
Me
mo
ry
PPHub
Memory
PP
R
Hu
b
Me
mo
ry
PP
Hub
Memory
PP
R Hu
b
Me
mo
ry
PPHub
Memory
PP
R
Hu
b
Me
mo
ry
PP
R
Hub
Memory
P P
North Carolina Supercomputing Center
NCSCNCSC
System Scaling
Hub
Memory
P P
Hu
b
Me
mo
ry
PP
Hub
Memory
PP
R Hu
b
Me
mo
ry
PPHub
Memory
PP
R
Hu
b
Me
mo
ry
PP
R
Hub
Memory
P P
Hu
b
Me
mo
ry
PP
R
North Carolina Supercomputing Center
NCSCNCSC
Origin Node Board
Two R12000 processors
1 GB main memory
Additional directory memory - used for cache coherence
Sockets for extra directory memory for systems with more than 32 processors
Hub interconnect chip
Hub
R12000R12000
L2cache
L2cache
Directory(>32proc)
Directory
Memory
XIO
NUMALink
North Carolina Supercomputing Center
NCSCNCSC
Origin Node Board
North Carolina Supercomputing Center
NCSCNCSC
Origin Module
Each router has six connections, two to nodes and four to other routers.
Systems with 32 or fewer processors will have extra router ports available and can use these for “express” links
Node 0
Node 3
Node 2
Node 1
Rou
ter
1R
oute
r 0
XB
OW
XB
OW
NUMALink
XIO
North Carolina Supercomputing Center
NCSCNCSC
Cache Coherence
Directory maintains state information for each L2 cache line in memory.
States unowned - not cached exclusive - 1 r/w copy shared - 1+ r/o copies poisoned - migrated to
another node
Directory includes a bit vector indicating processors with a copy of the cache line
Hub
Memory
PP
Hub
Memory
PP
Interconnection network
c c c c
directory directory
. . .
North Carolina Supercomputing Center
NCSCNCSC
Cache Architecture
L1 D-cache is 2-way set associative, LRU, writeback, 8 word lines, non-blocking
L2 cache is 2-way set associative, LRU, writeback, 32 word lines, non-blocking 8MB
L2 Cache
32KBIcache
32KBDcache
I register D register
128
128
64
~10 cycles/miss
~60+ cycles/miss
R12000
780MB/s
North Carolina Supercomputing Center
NCSCNCSC
Translation Lookaside BufferTLB is used to translate virtual addresses to physical addresses
R10000 TLB has 64 entries. Each entry can translate addresses for 2 pages (default page size is 16KB)
TLB miss costs about the same as a cache miss and causes similar performance issues.
North Carolina Supercomputing Center
NCSCNCSC
Origin2000Bandwidths and Latencies
I/O:Nodes:CPUs Memory I/O MaxLatency
AveLatency
1:1:2 0.780.680.59
1.561.25
313 313
2:2:4 1.561.371.19
3.122.5
497 405
2:4:8 3.122.732.38
6.244.99
601 528
4:8:16 6.245.474.75
12.489.98
703 641
8:16:32 12.4810.94
9.5
24.9619.97
805 710
Bandwidths in GB/s Latencies in ns
PhysicalPeak PayloadPeak Read
North Carolina Supercomputing Center
NCSCNCSC
R12000 Architecture
Superscalar 400 MHz clock 4 instructions/cycle
Cache 8MB L2 cache dedicated cache bus interleaved cache access non-blocking
Out-of-order Execution 3 instruction queue
Branch Prediction
North Carolina Supercomputing Center
NCSCNCSC
R12000 Architecture
Superscalar Architecture
Fetch/decode up to 4 instructions/cycle
Execute up to 4 instructions/cycle from 5 execution units
Load/store ALU1 ALU2 FPADD FPMUL
Instruction set binary compatible with
R8000 and R4000 32-bit and 64-bit
instructions
32 integer registers 32 floating-point registers
North Carolina Supercomputing Center
NCSCNCSC
Instruction Latencies – O2000
Load/Store Load store
Latency 2-3 1
Repeat Rate 1 1
Integer ALU1 add, sub, logic, shift, branches ALU2 add, sub, logic multiply (32-/64-bit) divide (32-/64-bit)
1 1 6/10 35/67
1 1 6/10 35/67
Floating Point add, compare multiply multiply-add divide (single/double) sqrt (single/double) rsqrt (single/double)
2 2 4 12/19 18/33 30/52
1 1 1 14/21 20/35 20/35
North Carolina Supercomputing Center
NCSCNCSC
Origin2400Architecture References
www.sgi.com/origin/2000
techpubs.sgi.com
North Carolina Supercomputing Center
NCSCNCSC
Code Developmentand Optimization Tools
North Carolina Supercomputing Center
NCSCNCSC
Code Porting/Optimization Objectives
Get the right answers
Identify resource consuming code sections
Utilize optimized system libraries
Let the compiler do the work
North Carolina Supercomputing Center
NCSCNCSC
Porting Issues(getting the right answers)
Application Binary Interface (ABI) 32 n32 (default/recommended for codes <2 GB total
memory) 64 (required for codes with >2GB total memory)
Instruction Set Architecture (ISA) mips2 mips3 mips4 (default)
defaults found from file /etc/compiler.defaults
North Carolina Supercomputing Center
NCSCNCSC
Profiling Tools
perfex - overall code performance
SpeedShop - procedure level performance data
dprof - memory access patterns
North Carolina Supercomputing Center
NCSCNCSC
R12000Hardware Performance Registers
Can select from 32 events
Two counter registers (can fully count two events per code execution)
0 - cycles
1 - issued instructions
2 - issued loads
3 - issued stores
4 - issued conditionals
5 - failed conditionals
6 - branches resolved
7 - quadwords written back from s-cache
8 - s-cache data errors (ECC)
9 - I-cache misses
10 - L2 cache miss - instruction
11 - instruction misprediction
12 - external interventions
13 - external invalidations
14 - function unit completion cycles
15 - graduated instructions
North Carolina Supercomputing Center
NCSCNCSC
R12000Hardware Performance Registers
Each counter can be set to count one of 16 events
counter 0 can count events 0-15
counter 1 can count events 16-31
Counter registers are 32 bit registers. Can be set to generate an interrupt on overflow.
16 - cycles
17 - graduated instructions
18 - graduated loads
19 - graduated stores
20 - graduated store conditionals
21 - graduated floating-point instructions
22 - quadwords written back from d-cache
23 - TLB misses
24 - mispredicted branches
25 - d-cache misses
26 - s-cache misses - data
27 - data misprediction
28 - external intervention s-cache hits
29 - external invalidation s-cache hits
30 - store/prefetch excl to clean block
31 - store/prefetch excl to shared block
North Carolina Supercomputing Center
NCSCNCSC
perfex
No special compilation needed
Can monitor two counters exactly - OR
Can monitor all counters (each 1/16th of the time) values then multiplied by 16 to approximate full counts
Option to convert counts to estimated times
% perfex -a -y -o data code.x
All counters
Estimate times
Redirect output
North Carolina Supercomputing Center
NCSCNCSC
perfexOutput
Based on 250 MHz IP27 Event definitions for cpu version 3.x
Typical
Event Counter Name Counter Value Time (sec)
=========================================================================================
0 Cycles...................................................... 898600299008 3594.401196
16 Cycles...................................................... 898600299008 3594.401196
26 Secondary data cache misses................................. 7034639424 2124.461106
7 Quadwords written back from scache.......................... 18935563200 484.750418
25 Primary data cache misses................................... 7449172608 268.468181
2 Issued loads................................................ 59030982976 236.123932
14 ALU/FPU forward progress cycles............................. 48181262304 192.725049
18 Graduated loads............................................. 46436171712 185.744687
3 Issued stores............................................... 19988999248 79.955997
22 Quadwords written back from primary data cache.............. 4971802640 76.565761
19 Graduated stores............................................ 18055579056 72.222316
6 Decoded branches............................................ 5225243088 20.900972
21 Graduated floating point instructions....................... 2699848928 10.799396
24 Mispredicted branches....................................... 1033609888 5.870904
9 Primary instruction cache misses............................ 374656 0.027005
Edited for presentation
North Carolina Supercomputing Center
NCSCNCSC
perfexOutput23 TLB misses.................................................. 1904 0.000519
10 Secondary instruction cache misses.......................... 256 0.000077
4 Issued store conditionals................................... 160 0.000001
20 Graduated store conditionals................................ 32 0.000000
30 Store/prefetch exclusive to clean block in scache........... 32 0.000000
1 Issued instructions......................................... 147707069072 0.000000
5 Failed store conditionals................................... 0 0.000000
8 Correctable scache data array ECC errors.................... 0 0.000000
11 Instruction misprediction from scache way prediction table.. 512 0.000000
12 External interventions...................................... 2525856 0.000000
13 External invalidations...................................... 7415216 0.000000
15 Graduated instructions...................................... 136445826704 0.000000
17 Graduated instructions...................................... 136469377216 0.000000
27 Data misprediction from scache way prediction table......... 804101376 0.000000
28 External intervention hits in scache........................ 1744336 0.000000
29 External invalidation hits in scache........................ 3193680 0.000000
31 Store/prefetch exclusive to shared block in scache.......... 0 0.000000
North Carolina Supercomputing Center
NCSCNCSC
perfexOutputStatistics
=========================================================================================
Graduated instructions/cycle................................................ 0.151843
Graduated floating point instructions/cycle................................. 0.003005
Graduated loads & stores/cycle.............................................. 0.071769
Graduated loads & stores/floating point instruction......................... 23.887170
Mispredicted branches/Decoded branches...................................... 0.197811
Graduated loads/Issued loads................................................ 0.786641
Graduated stores/Issued stores.............................................. 0.903276
Data mispredict/Data scache hits............................................ 1.939776
Instruction mispredict/Instruction scache hits.............................. 0.001368
L1 Cache Line Reuse......................................................... 7.657572
L2 Cache Line Reuse......................................................... 0.058927
L1 Data Cache Hit Rate...................................................... 0.884494
L2 Data Cache Hit Rate...................................................... 0.055648
Time accessing memory/Total time............................................ 0.737507
Time not making progress (probably waiting on memory) / Total time.......... 0.946382
L1--L2 bandwidth used (MB/s, average per process)........................... 88.449327
Memory bandwidth used (MB/s, average per process)........................... 334.799259
MFLOPS (average per process)................................................ 0.751126
Not good
North Carolina Supercomputing Center
NCSCNCSC
SpeedShop
No special compilation needed
Provides the following types of profiling Program counter sampling Ideal time User time Hardware counter profiling Floating-point exception tracing Heap tracing
North Carolina Supercomputing Center
NCSCNCSC
SpeedShop
PC Sampling
Provides estimate of time spent by each function in executable
Two step process: execute code with ssrun use prof to examine
results
%ssrun -pcsamp prog
%prof prog.pcsamp.4324
North Carolina Supercomputing Center
NCSCNCSC
pcsamp outputSummary of statistical PC sampling data (pcsamp)--
13060: Total samples
130.600: Accumulated time (secs.)
10.0: Time per sample (msecs.)
2: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 58.230 44.6% 44.6% 5823 zaver (prog: prog.f, 69)
[2] 37.490 28.7% 73.3% 3749 yaver (prog: prog.f, 50)
[3] 34.460 26.4% 99.7% 3446 xaver (prog: prog.f, 31)
[4] 0.420 0.3% 100.0% 42 main (prog: prog.f, 1)
130.600 100.0% 100.0% 13060 TOTAL
North Carolina Supercomputing Center
NCSCNCSC
SpeedShop
Ideal time
Estimates best possible time the code could achieve - by routine
Useful for identifying routines with cache problems
% ssrun -ideal prog
beginning libraries
/usr/lib32/libssrt.so
/usr/lib32/libftn.so
/usr/lib32/libm.so
ending libraries, beginning prog
% prof prog.ideal.3453
North Carolina Supercomputing Center
NCSCNCSC
ideal outputSummary of ideal time data (ideal)--
23468025764: Total number of instructions executed
26959868891: Total computed cycles
107.839: Total computed execution time (secs.)
1.149: Average cycles / instruction
-------------------------------------------------------------------------
Function list, in descending order by exclusive ideal time
-------------------------------------------------------------------------
[index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line)
[1] 36.133 33.5% 33.5% 9033236300 7740175400 100 zaver (prog: prog.f, 69)
[2] 35.737 33.1% 66.6% 8934236300 7839175400 100 xaver (prog: prog.f, 31)
[3] 35.737 33.1% 99.8% 8934236300 7839175400 100 yaver (prog: prog.f, 50)
[4] 0.221 0.2% 100.0% 55184326 46134726 1 main (prog: prog.f, 1)
Hundreds more lines of library calls omitted
North Carolina Supercomputing Center
NCSCNCSC
SpeedShop
Hardware Counter Profiling
prof_hwd Counter selected with
environment variable_SPEEDSHOP_HWC_COUNTER_NUMBE
R
Most commonly used counters have experiment names
gi_hwc – graduated instructions
cy_hwc – cycles ic_hwc – L1 Icache miss isc_hwc – L2 Icache miss dc_hwc – L1 Dcache miss dsc_hwd – L2 Dcache
miss tlb_hwc – TLB miss gfp_hwc – graduated FP
instructions
North Carolina Supercomputing Center
NCSCNCSC
SpeedShop
The –b or –gprof options to prof will generate a dynamic calling tree.
Procedures are listed by calling and called by.
North Carolina Supercomputing Center
NCSCNCSC
WorkShop
One of the Workshop tools, cvperf, provides a GUI interface to view the SpeedShop experiment results
North Carolina Supercomputing Center
NCSCNCSC
Workshop
ssusage Speed shop program runs executable and prints resources used Useful for finding out memory use ssusage mypgm
North Carolina Supercomputing Center
NCSCNCSC
WorkShop
Workshop also includes a debugger, cvd
The common UNIX debugger, dbx, is also available
North Carolina Supercomputing Center
NCSCNCSC
WorkShop
Other WorkShop components include
cvbuild – build dependency analyzer
cvstatic – static source analyzer
WorkShop can be configured to work with a source code revision control system (see cvconfig)
cvpav – parallel analysis for MP Fortran programs
North Carolina Supercomputing Center
NCSCNCSC
Performance Libraries
fastm Fast transcendental library Link w/ -lfastm Faster results at the trade off of some accuracy See man libfastm
SCSL Scientific Computing Software Library See man intro_scsl and man pages referenced therein Signal processing including FFT, correlation, convolution LAPACK Linear solvers Matrix and Vector routines
North Carolina Supercomputing Center
NCSCNCSC
Compilers
MIPSpro Compilers CC cc f90 f77
Optimizations Software pipelining (SWP) Inter-procedural analysis
(IPA) Loop nest optimizations
(LNO)
North Carolina Supercomputing Center
NCSCNCSC
Compilers
-O[n] 0 => no optimization – use
only for debugging (default!) 1 => simple optimizations 2 => conservative
optimizations, should not alter results
If just -O is specified, -O2 is invoked
Fast => -O3 –IPA –OPT:roundoff=3:alias=typed
3 => SWP, LNO, and other aggressive optimizations, may alter results
North Carolina Supercomputing Center
NCSCNCSC
Compilers
-OPT IEEE_arithmetic=n –
conformance with IEEE floating-point arithmetic
1 (default) compliant 2 inexact results may
differ (not-a-number, infinity)
3 allows arbitrary, mathematically valid transformations
roundoff=n – acceptable round off altering optimization 0-3 where 0 is none and 3 is any
alias=n – pointer aliasing model
North Carolina Supercomputing Center
NCSCNCSC
Compilers
-OPT:alias=<name>
ANY, COMMON_SCALAR ANY is default
TYPED, NO_TYPED Different base types point to
distinct objects
UNNAMED, NO_UNNAMED Pointers never point to named
objects
RESTRICT, NO_RESTRICT Distinct pointers point to
distinct, non-overlapping objects
parm, no_parm Fortran only
Do not lie to the compiler!
North Carolina Supercomputing Center
NCSCNCSC
Compilers
Software Pipelining
do i=1,n
y(i) = y(i) + a*x(i)
enddo
Each loop iteration contains
2 loads, 1 store 1 multiply-add 2 address increments Loop end test, branch
Superscalar processor slots
1 load/store 1 ALU1, 1ALU2 1 FP add 1 FP multiply
North Carolina Supercomputing Center
NCSCNCSC
Software Pipeliningo
pe
rati
on
s
Load x
Load y
x++
madd
Store y
branch
y++
Lo
ad
/sto
re
AL
U1
AL
U2
FP
AD
D
FP
MU
L
clo
ck
0
1
2
3
4
5
6
7
2 flop / 8 cycles achieved16 flop / 8 cycles peakRunning 1/8th of peak performance
North Carolina Supercomputing Center
NCSCNCSC
Software Pipelining
Pipelined daxpy
Load/store is bottleneck
Optimize to fully utilize load/store unit
Lo
ad
/sto
re
FP
AD
D
FP
MU
L
clo
ck
0
1
2
3
4
5
6
7
8
9
10
11
12
13
8 flop / 14 cycles achieved28 flop / 14 cycles peakRunning better than 1/4th of peak performance
North Carolina Supercomputing Center
NCSCNCSC
Software Pipelining
Use –O3 to enable pipelining
Vectorizable loops are well suited for pipelining
SWP cannot be done if loop contains
Function calls Complicated conditionals Branching
SWP is impeded by Recurrences between
iterations (can use IVDEP directive)
Very long loop (split loop) Register overflow (split
loop)
SWP algorithms are heuristic
Schedules are not unique Finding schedule may be
computationally expensive
North Carolina Supercomputing Center
NCSCNCSC
Inter-Procedural Analysis
Analyzes entire program
Precedes other optimizations
Performs optimizations across procedure boundaries
Invoke with -IPA
Compile step will finish quickly – link step will take much longer
If any procedure changes must recompile full program
North Carolina Supercomputing Center
NCSCNCSC
Inlining
IPA provides automatic inlining with preference to
Small procedures Calls in innermost loops Leaf routines Frequent calls
Manual inlining using command line option -INLINE
Routines must be in same file
Only inlines specified routines
North Carolina Supercomputing Center
NCSCNCSC
Inlining
Benefits Exposes larger context for
later optimization Eliminates call overhead
Costs Longer compile time Additional contention for
registers Larger code size
• Restrictions• no mismatched parameter types• no static local variables• no recursive routines
North Carolina Supercomputing Center
NCSCNCSC
Cache Optimization
North Carolina Supercomputing Center
NCSCNCSC
L2 Cache Organization
2-way set associative i.e. each memory address can be in one of 2
different cache lines
Cache line is 128 bytes e.g. 16x8bytes or 32x4bytes
Least recently used (LRU) replacement strategy
Shared instruction and data cache
North Carolina Supercomputing Center
NCSCNCSC
Cache Organization
offsetC
ach
e li
ne
Memory address
Set 0 Set 1
Mem
ory
North Carolina Supercomputing Center
NCSCNCSC
Cache Basics
Access data with stride one wherever possible
Group data to be used together
Avoid power-of-2 array dimensions
North Carolina Supercomputing Center
NCSCNCSC
Standard Cache Optimization
Small stride – order loops so that innermost loop has smallest stride
Padding – pad leading dimensions of arrays to prevent overlap in cache and/or add padding between arrays in common blocks
Loop fusion – join small loops to increase cache reuse
North Carolina Supercomputing Center
NCSCNCSC
Cache BlockingDO J = 1, N
DO I = 1, M
DO K = 1, L
C(I,J)=C(I,J) +
A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
M,N,L sec MFLOPS
-------- ----- -------------
30 1.6e-4 333.9
200 5.7e-2 282.6
1000 25.4 78.6
North Carolina Supercomputing Center
NCSCNCSC
TLB Misses
Caused by too few entries for amount of data to be mapped
Increasing the page size allows fixed number of TLB entries to map larger amount of data
IRIX allows two page sizes 16KB (default) and one larger page size
dplace command allows selection of a larger page size (see man dplace)
North Carolina Supercomputing Center
NCSCNCSC
Loop Nest Optimization (LNO)
Improve cache use and instruction scheduling with loop transformations
Loop interchange Padding Loop fusion Cache blocking Prefetching Loop unrolling
Run by default with -O3 or -Ofast
Disable with –LNO:opt=0
Endless opportunity to tune each optimization individually with directives and flags
North Carolina Supercomputing Center
NCSCNCSC
Loop Unrolling
Compiler option -LNO:outer_unroll=n
Directives Fortran: c*$* unroll(n) C: #pragma unroll(n)
North Carolina Supercomputing Center
NCSCNCSC
Loop Interchange
Compiler option -LNO:interchange=off
Directives C*$* no interchange C*$* interchange(i,j,k)
#pragma no interchange #pragma interchange(i,j,k)
North Carolina Supercomputing Center
NCSCNCSC
Cache Blocking
LNO automatically blocks loop nests to fit cache
To disable -LNO:blocking=off C*$* no blocking #pragma no blocking
Can also provide input to blocking size and cache model (see man LNO)
Disable If loop nest already fits in
cache (to save blocking overhead)
Off if blocking is causing poor performance
North Carolina Supercomputing Center
NCSCNCSC
Padding
LNO automatically pads locally allocated arrays
For –O3 and –Ofast LNO automatically pads common blocks
Each routine containing common must be compiled with same option
Code must not violate FORTRAN standard
Disable common block padding
-OPT:reorg_common=off
North Carolina Supercomputing Center
NCSCNCSC
Single Processor TuningSummary
Use perfex and SpeedShop to analyze code
Choose best ISA and ABI (-mips4 -n32)
Use optimized libraries –lfastm –lscs
Inline small procedures or use IPA automatic inlining
Check compiler messages for time consuming loops – may be able to improve with –OPT or –LNO directives
Minimize cache and TLB misses
Use stride one memory accesses (or smallest possible)
Avoid power-of-2 array dimensions
Increase page size to reduce TLB misses
North Carolina Supercomputing Center
NCSCNCSC
NCSC Origin2400User Environment
North Carolina Supercomputing Center
NCSCNCSC
System
48 400 MHz R12000 processors
8MB L2 cache/processor
24 GB memory
> 1 TB fast local disks
sonoma.ncsc.org
North Carolina Supercomputing Center
NCSCNCSC
Storage
Home directory Fairly small quota (100 MB)
/tmp Temporary storage for
executing jobs Not backed up Periodic purge
/dmf Mass storage system Local to sonoma dmls dmget Each user has a dmf
account
North Carolina Supercomputing Center
NCSCNCSC
Interactive Jobs
Interactive limits are imposed using software developed at NCSA
Interactive limits are 30 CPU minutes 512 MB memory 4 processors Subject to change
North Carolina Supercomputing Center
NCSCNCSC
Batch Jobs
Jobs too large to be run interactively must be submitted to the batch system
NQE is the current batch system
Create a batch request script using your favorite editor (emacs is a good choice, but jot and vi are also available)
Use the qsub command to submit the job to the batch queue
Request resources needed for the job:
CPU time Memory Processors
North Carolina Supercomputing Center
NCSCNCSC
Batch Request Script
Text File
Execution will begin in your home directory
Will execute your environment files by default
#QSUB –lT 7200
#QSUB –lM 1024mb
#QSUB –l mpp_p=8
setenv OMP_NUMTHREADS 8
setenv OMP_DYNAMIC false
cd /tmp/user
cp ~/executable .
./executable
mv results /dmf/edu/user
rm *
North Carolina Supercomputing Center
NCSCNCSC
NQE
qsub
qsub –lT 7200 –lM 1024MB \
-l mpp_p=8 script.q
qstat –au $user
qdel <xxxxx>
qdel –k <xxxxx>
qs
qstat –b
qstat –f <queue_name>
Standard output, standard error, and NQE log are returned in files to the directory from which the qsub command was issued at the end of the job
Use –o and –eo to override this behavior
North Carolina Supercomputing Center
NCSCNCSC
Here are some options I like …#! /bin/csh -f
# name the request rather than default to script name
#QSUB -r myOpenMP_job
#QSUB -lT 0:15:00
#QSUB -lM 1GB
#QSUB -l mpp_p=4
# send mail to [email protected] when the job ends
#QSUB -me -mu [email protected]
# redirect standard error and output
#QSUB -o batch.log -eo
date
cd $QSUB_WORKDIR
#specify number of processors to run on
setenv OMP_NUM_THREADS 4
# run the job
./a.out
date
North Carolina Supercomputing Center
NCSCNCSC
Parallel Program IssuesMultiple programming models and APIs are supported
Many are out-of-date and have been superceded by newer models
Many use environment variables for control information
The man page pe_environ gives an up-to-date list of all these environment variables
North Carolina Supercomputing Center
NCSCNCSC
Parallel Program IssuesNumber of processors for shared memory executables
OMP_NUM_THREADS
Number of processors is “dynamic” by default (based on number of idle processors) This can have undesirable side effects and may be disabledOMP_DYNAMIC FALSE
North Carolina Supercomputing Center
NCSCNCSC
Running MPI jobs
Use mpirun, see man page mpirun –np 8 mypgm
Use –cpr flag to checkpoint batch jobs
Running with perfex in batch mpirun –cpr –np 8 perfex –a –y mypgm similarly for ssusage, ssrun
North Carolina Supercomputing Center
NCSCNCSC
Checkpointing
Executing jobs are checkpointed by the system at regular intervals
Some jobs will not successfully checkpoint
3rd party applications using Flexlm license manager
QSUB option –nc will prevent checkpointing
mpirun option –cpr is required to enable checkpointing of MPI jobs