Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Research ComputingUNIVERSITY OF COLORADO
The JANUS Computing Environment
Monte [email protected]
Thursday, June 21, 12
What is JANUS?November, 2011
1,368 Compute nodes
16,416 processors
~ 20 GB of available space
~ 800 TB of storage
2.8Ghz Intel Westmere
TFLOPS is a rate of execution, trillions of floating point operations per second
Thursday, June 21, 12
NUMA Architecture
Resource Management and
“queues”
Parallel file systems
Different architectures
Explicit environment
Lots of ways to do something...
Thursday, June 21, 12
Online resourceswww.rc.colorado.edu
Thursday, June 21, 12
OverviewAccess
Login, file system, data transfer
Software
Supported software, dotkits, building software
Resource Management
Queues, Moab, and Torque
Running Jobs
Single-core, load-balanced, MPI, OpenMP
Questions
Thursday, June 21, 12
Access
Thursday, June 21, 12
Login Proceduressh <username>@login.rc.colorado.edu
Password: Yubikeys or Cryptocards
Thursday, June 21, 12
RC FilesystemHome directory
/home/<user_name>
2 Gb, Network File System (NFS)
Project space
/projects/<user_name>
250 Gb, NFS
Scratch space
/lustre/janus_scratch/<user_name>
No quota, no backup
Lustre file system
Build software here
Run software here
Thursday, June 21, 12
SnapshotDid you accidentally remove a file or directory?
$HOME/.snapshot/hourly.[0-12]$HOME/.snapshot/nightly.[0-6]$HOME/.snapshot/weekly.[0-7]
Example
rm $HOME/bugreport.csh cp $HOME/.snapshot/weekly.0/bugreport.csh $HOME
Where?
$HOME/.snapshot/projects/<user_name>/.snapshot
Thursday, June 21, 12
LustreScalable, POSIX-compliant parallel file system designed for large, distributed-memory systems
Object Storage Targets (OST)
Store user file data
Object Storage Servers (OSS)
Control I/O access and handling network request
Metadata Target (MDT)
Stores filenames, directories, permissions and file layout
Metadata Server (MDS)
Assigns storage locations associated with each file in order to direct file I/O requests to the correct set of OST
Thursday, June 21, 12
IB
MDS MDT
OSS OST
Metadata server (MDS) and target (MDT)
Object storage server (OSS) and target (OST)
Thursday, June 21, 12
IB
MDT
OSS OST
File Access
MDS
Compute node requests storage location
Compute node then interacts directly with OST
Thursday, June 21, 12
StripingFile - contiguous sequence of bytes
Key feature: Lustre file system can distribute these segments multiple OSTs using a technique called file striping.
A file is said to be striped when its contiguous sequence of bytes is separated into small chunks, or stripes, so that read and write operations can access multiple OSTs concurrently.
/file
/file
Thursday, June 21, 12
File I/O
/file /file1 /file2 /filen
Serial File-per-process
Shared file
/file
Collective Buffering: Not currently supported on JANUS
Thursday, June 21, 12
Single processor
stripe count
writ
e sp
eed
(Mb/
s)
0
200
400
600
800
●
● ● ●●
●
●
●
●
●●
●
●
●
1 2 4 8 15 30 60
Transfer size● 1 mb● 32 mb
Thursday, June 21, 12
File per processor
processors (files)
writ
e sp
eed
(Mb/
s)
0
2000
4000
6000
8000
10000
12000
●●
● ●●
●
●
●
●
●
●
●
1 2 4 8 16 32 64 128 256 512 1024 2048
Thursday, June 21, 12
Shared-file with striping
processors (files)
writ
e sp
eed
(Mb/
s)
1000
2000
3000
4000
5000
6000
7000
● ● ● ● ●
●
●
●
●
●
1 2 4 8 16 32 64 128 256 1024
Thursday, June 21, 12
Examples
bash-janus> mkdir temp_dir
bash-janus> lfs setstripe -c 3 temp_dir
bash-janus> touch temp_dir/temp_file
bash-janus> lfs getstripe temp_dir
temp_dirstripe_count: 3 stripe_size: 33554432 stripe_offset: -1
temp_dir/temp_filelmm_stripe_count: 3lmm_stripe_size: 33554432lmm_stripe_offset: 18 obdidx objid objid group 18 12787913 0xc320c9 0 7 12863377 0xc44791 0 23 12496893 0xbeaffd 0
Thursday, June 21, 12
Data transferhttps://www.rc.colorado.edu/crcdocs/file-transfer
Grid FTP
GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks
Globus Online
Large file transfers with “drag and drop archiving” to move data between its long-time archival storage and compute systems
Utilities
scp, sftp, rsync
Good for small files
Thursday, June 21, 12
Access tipsControl Sockets
One-time passwords make multiple terminal sessions and file transer painful.
mkdir -p ~/.ssh/socketscat >> ~/.ssh/config << EOFHost login.rc* ControlMaster auto ControlPath ~/.ssh/sockets/%r@%h:%pEOF
Mount Drive
http://macfusionapp.org/
Symbolic links
/project, /scratch
Thursday, June 21, 12
Software
Thursday, June 21, 12
Software support
RC expertise
less general
Unsupported software
Installation
Consulting
Advice on installing your software
and any dependancies
Supported software
select state-of-the-art software
Installation, verification, and training
user
exp
ertis
e
Thursday, June 21, 12
Environment
To run an executable, you need to know where it is.
/opt/openmpi/1.4.4/bin/mpicxx
/opt/mpitch2/1.5a2/bin/mpicxx
Which one does the command which mpicxx use?
PATH
What about libraries?
/opt/openmpi/1.4.4/lib/libmpi.so
/opt/mpitch2/1.5a2/lib/libmpi.so
LD_LIBRARY_PATH
Thursday, June 21, 12
DotkitManages your environmental variables
use list packages in use
use -a list hidden packages in use
use <package_name> add a package to environment
unuse <package_name> remove package from environment
use -la list available packages
use -la <term> list packages that contain <term>
Thursday, June 21, 12
Examplesuse NCAR-Parallel-Intel
bash-janus> echo $PATH/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_openmpi-1.4.5_intel-12.1.4/bin/curc/tools/free/redhat_5_x86_64/openmpi-1.4.5_intel-12.1.4/bin/curc/tools/free/redhat_5_x86_64/torque-2.5.8/bin/curc/tools/free/redhat_5_x86_64/netcdf-4.1.3_intel-12.1.4_hdf-4.2.6_hdf5-1.8.8_openmpi-1.4.5/bin/curc/tools/free/redhat_5_x86_64/hdf5-1.8.8_openmpi-1.4.5_intel-12.1.4/bin/curc/tools/nonfree/redhat_5_x86_64/intel-12.1.4/composer_xe_2011_sp1.10.319/bin/intel64/curc/tools/free/redhat_5_x86_64/sun_jdk-1.6.0_23-x86_64/bin/curc/tools/free/redhat_5_x86_64/hdf-4.2.6_ics-2012.0.032/bin/curc/tools/free/redhat_5_x86_64/szip-2.1/bin/curc/tools/nonfree/redhat_5_x86_64/moab-6.1.5/bin
Thursday, June 21, 12
Building SoftwareI need the Boost C++ library for my software. Where should I build this?
/home/molu8455/projects/software/boost/1.49.0
Build on a compute node (e.g. qsub -I)
Ideas
Consider sharing this with your group.
How about your own dotkit?
Thursday, June 21, 12
Build your own dotkitcat $HOME/.kits/TeachingHPC.dk
#c Teaching HPC#d This contains the libraries I use for teaching HPC:#d .openmpi-1.4.3_gcc-4.5.2_torque-2.5.8_ib #d .hdf5-1.8.6
# Dependenciesdk_op -q .torque-2.5.8dk_op -q .openmpi-1.4.3_gcc-4.5.2_torque-2.5.8_ib dk_op -q .hdf5-1.8.6
# Variablesdk_alter HDF5_DIR /curc/tools/free/redhat_5_x86_64/hdf5-1.8.6dk_alter BOOST_ROOT /home/molu8455/projects/software/boost/1.49.0
dk_alter LD_LIBRARY_PATH /home/molu8455/projects/software/boost/1.49.0/lib
Thursday, June 21, 12
Resource Management
Thursday, June 21, 12
Scheduling
1
6
4
3
5
7
2Time
Node
s
Thursday, June 21, 12
Scheduling
1
6
4
3 572
Time
Node
s
Thursday, June 21, 12
Moab and TorqueMoab
Brains of the operation
Comes up with the “schedule”
Torque
Reports information to Moab
Receives direction from Moab
Handles users requests
Provide job query facilities
Thursday, June 21, 12
Commands
showq -u <username> Show jobs in the queue
canceljob <job_id> or ALL Cancel your job(s)
checkjob <job_id> Information about your job
qsub submit jobs
showstart <job_id> When will your job start?
showq -u <username> Show jobs in the queue
Thursday, June 21, 12
qsubRequest a resource for your job
1) batch or 2) interactive
Makes environmental variables available to your job
PBS_O_*PBS_O_WORKDIRPBS_NODEFILE
Options
-q <queue_name>-l <resource_list>-I interactive-N <name>-e <error_path>-o <output_path>-j <join_path>
Thursday, June 21, 12
Queues
Name Nodes Max Time Node Sharing
janus-debug 1-480 1 hour
janus-short 1-480 4 hours
janus-long 1-80 7 days
janus-small 1-20 1 day
janus-normal 21-80 1 day
janus-wide 81-480 1 day
Thursday, June 21, 12
Running Jobs
Thursday, June 21, 12
ProcessHow many processors do I need?
Approximately how long will this take?
showstart 1024@30:00showstart 16@16:00:00
Which queue best fits this criteria?
2
4
Node
s
Time
Name Nodes Max Time Node Sharing
janus-debug 1-480 1 hour
janus-short 1-480 4 hours
janus-long 1-80 7 days
janus-small 1-20 1 day
janus-normal 21-80 1 day
janus-wide 81-480 1 day
Thursday, June 21, 12
Serial Jobs#!/bin/bash
#PBS -N example_1#PBS -q janus-debug#PBS -l walltime=00:05:00#PBS -l nodes=1:ppn=1#PBS -e errfile#PBS -o outfile
cd $PBS_O_WORKDIR
# run trial 1 of the simulator./simulator 1 > sim.1
Thursday, June 21, 12
Pack the node#!/bin/bash
#PBS -N example_2#PBS -q janus-debug#PBS -l walltime=0:00:30, nodes=1:ppn=12
cd $PBS_O_WORKDIR
./simulator 1 > sim.1 &
./simulator 2 > sim.2 &
./simulator 3 > sim.3 &
./simulator 4 > sim.4 &
./simulator 5 > sim.5 &
./simulator 6 > sim.6 &
./simulator 7 > sim.7 &
./simulator 8 > sim.8 &
./simulator 9 > sim.9 &
./simulator 10 > sim.10 &
./simulator 11 > sim.11 &
./simulator 12 > sim.12 &
wait
Thursday, June 21, 12
Multi-node serial jobs?Consider using our load-balancing tool.
https://www.rc.colorado.edu/tutorials/loadbalance
#!/bin/bash#PBS -N example_1#PBS -q janus-debug#PBS -l walltime=00:05:00#PBS -l nodes=2:ppn=12
cd $PBS_O_WORKDIR
. /curc/tools/utils/dkinitreuse LoadBalance
mpirun load_balance -f cmd_lines
./simulator 1 > sim.1
./simulator 2 > sim.2
./simulator 3 > sim.3
./simulator 4 > sim.4
./simulator 5 > sim.5
./simulator 6 > sim.6
./simulator 7 > sim.7
./simulator 8 > sim.8
./simulator 9 > sim.9
./simulator 10 > sim.10
...
./simulator 2000 > sim.2000
Thursday, June 21, 12
MPI#!/bin/bash
#PBS -N example_4#PBS -q janus-debug#PBS -l walltime=0:10:00#PBS -l nodes=3:ppn=12
cd $PBS_O_WORKDIRresuse .openmpi-1.4.5_intel-12.1.4
# run trial 1 of the simulatormpirun -np 36 ./simulator mpirun ./simulator
Thursday, June 21, 12
Non-Uniform Memory Access (NUMA)Each socket has a dedicated memory area for high speed access
Also has an interconnect to other sockets for slower access to the other sockets' memory
memory memory
memory controlmemory control
Thursday, June 21, 12
MPI OpenMP / High Memory#!/bin/bash
#PBS -N example_5#PBS -q janus-debug#PBS -l walltime=0:10:00#PBS -l nodes=3:ppn=12
cd $PBS_O_WORKDIR. /curc/tools/utils/dkinitresuse .openmpi-1.4.5_intel-12.1.4
export OMP_NUM_THREADS=12mpirun --bind-to-core --bynode --npernode 1 ./simulator
export OMP_NUM_THREADS=6mpirun --bind-to-socket --bysocket --npersocket 1 ./simulator
Thursday, June 21, 12
SummaryAccess
Use control sockets for login
Filesystem
Build software in /projects/<username>
Run your jobs in /lustre/janus_scratch/<user_name>
Recover files with .snapshot
Consider striping when using shared-file access.
Data Transfer
Large files: Globus Online, Grid FTP
Smaller files: sftp, scp
Thursday, June 21, 12
Software
Build on compute node.
Manage environment with your own dotkits.
Resource Management
Familiarize yourself with the queues
When you have choices... showstart
Running Jobs
Request what you need and manage with LoadBalance
OpenMP: be aware of NUMA
Limit the number of processes per node for hybrid and high memory
Thursday, June 21, 12
Questions?
Thursday, June 21, 12
Collective bufferingAt large core counts, I/O performance can be hindered by:
MDS contention (file-per-process)
file system contention (shared-file)
Use a subset of application processes to perform I/O.
limits the number of files (file-per-process)
limits the number of processes accessing file system resources (shared-file).
Offloads work from the file system to the application
A subset of processors write - reducing contention
Thursday, June 21, 12