High Performance Computing Workshop HPC 101 Dr. Charles J Antonelli LSAIT ARS February, 2014

Preview:

Citation preview

High PerformanceComputing Workshop

HPC 101Dr. Charles J Antonelli

LSAIT ARSFebruary, 2014

cja 2014 2

CreditsContributors:

Brock Palen (CAEN HPC)

Jeremy Hallum (MSIS)

Tony Markel (MSIS)

Bennet Fauber (CAEN HPC)

Mark Montague (LSAIT ARS)

Nancy Herlocher (LSAIT ARS)

LSAIT ARS

CAEN HPC

2/14

cja 2014 3

Roadmap

High Performance Computing

Flux Architecture

Flux Mechanics

Flux Batch Operations

Introduction to Scheduling

2/14

4

High Performance Computing

2/14cja 2014

cja 2014 5

Cluster HPC

A computing cluster a number of computing nodes connected together via special hardware and software that together can solve large problems.

A cluster is much less expensive than a single supercomputer (e.g., a mainframe)

Using clusters effectively requires support in scientific software applications (e.g., Matlab's Parallel Toolbox, or R's Snow library), or custom code

2/14

cja 2014 6

Programming Models

Two basic parallel programming modelsMessage-passingThe application consists of several processes running on different nodes and communicating with each other over the network

Used when the data are too large to fit on a single node, and simple synchronization is adequate

“Coarse parallelism”

Implemented using MPI (Message Passing Interface) libraries

Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives

Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable

“Fine-grained parallelism” or “shared-memory parallelism”

Implemented using OpenMP (Open Multi-Processing) compilers and libraries

Both

2/14

cja 2014 7

Amdahl’s Law

2/14

cja 2014 8

Flux Architecture

2/14

cja 2014 9

FluxFlux is a university-wide shared computational discovery / high-performance computing service.

Provided by Advanced Research Computing at U-M

Operated by CAEN HPC

Procurement, licensing, billing by U-M ITS

Interdisciplinary since 2010

2/14

http://arc.research.umich.edu/resources-services/flux/

cja 2014 10

The Flux clusterLogin nodes Compute nodes

Storage…

Data transfernode

2/14

cja 2014 11

A Flux node

12-16 Intel cores

48-64 GB RAM

Local disk

Ethernet InfiniBand

2/14

cja 2014 12

A Large Memory Flux node

1 TB RAM

Local disk

Ethernet InfiniBand

2/14

32-40 Intel cores

cja 2014 13

Coming soon:A Flux GPU node

16 Intel cores

64 GB RAM

Local disk

2/14

8 GPUs

Each GPU contains 2,688 GPU cores

cja 2014 14

Flux softwareLicensed and open software:

Abacus, BLAST, BWA, bowtie, ANSYS, Java, Mason, Mathematica, Matlab, R, RSEM, STATA SE, …

See http://cac.engin.umich.edu/resources

C, C++, Fortran compilers:Intel (default), PGI, GNU toolchains

You can choose software using the module command

2/14

cja 2014 15

Flux networkAll Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network

The Flux login nodes are also connected to the campus backbone network

The Flux data transfer node is connected over a 10 Gbps connection to the campus backbone network

This meansThe Flux login nodes can access the Internet

The Flux compute nodes cannot

If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications

2/14

cja 2014 16

Flux dataLustre filesystem mounted on /scratch on all login, compute, and transfer nodes

640 TB of short-term storage for batch jobs

Large, fast, short-term

NFS filesystems mounted on /home and /home2 on all nodes

80 GB of storage per user for development & testing

Small, slow, long-term

2/14

cja 2014 17

Flux dataFlux does not provide large, long-term storage

Alternatives:Value Storage (NFS)

$20.84 / TB / month (replicated, no backups)

$10.42 / TB / month (non-replicated, no backups)

LSA Large Scale Research Storage2 TB free to researchers (replicated, no backups)

Faculty members, lecturers, postdocs, GSI/GSRA

Additional storage $30 / TB / year (replicated, no backups)

Departmental server

CAEN can mount your storage on the login nodes

2/14

cja 2014 18

Copying dataThree ways to copy data to/from Flux

From Linux or Mac OS X, use scp:scp localfile login@flux-xfer.engin.umich.edu:remotefilescp login@flux-login.engin.umich.edu:remotefile localfilescp -r localdir login@flux-xfer.engin.umich.edu:remotedir

From Windows, use WinSCP

U-M Blue Dischttp://www.itcs.umich.edu/bluedisc/

Use Globus Connect

2/14

cja 2014 19

Globus ConnectFeatures

High-speed data transfer, much faster than SCP or SFTP

Reliable & persistent

Minimal client software: Mac OS X, Linux, Windows

GridFTP EndpointsGateways through which data flow

Exist for XSEDE, OSG, …

UMich: umich#flux, umich#nyx

Add your own client endpoint!

Add your own server endpoint: contact flux-support@umich.edu

More informationhttp://cac.engin.umich.edu/resources/login-nodes/globus-gridftp

2/14

cja 2014 20

Flux Mechanics

2/14

cja 2014 21

Using Flux

Three basic requirements to use Flux:

1. A Flux account2. A Flux allocation3. An MToken (or a Software Token)

2/14

cja 2014 22

Using Flux1. A Flux account

Allows login to the Flux login nodes

Develop, compile, and test code

Available to members of U-M community, free

Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication

2/14

cja 2014 23

Using Flux2. A Flux allocation

Allows you to run jobs on the compute nodesSome units cost-share Flux rates

Regular Flux:  $11.72/core/monthLSA, Engineering, Medical School $6.60/month

Large Memory Flux: $23.82/core/monthLSA, Engineering, Medical School $13.30/month

GPU Flux: $107.10/2 CPU cores and 1 GPU/monthLSA, Engineering, Medical School $60/month

Flux Operating Environment: $113.25/node/monthLSA, Engineering, Medical School $63.50/month

Flux pricing at http://arc.research.umich.edu/flux/hardware-services/

Rackham grants are available for graduate studentsDetails at http://arc.research.umich.edu/resources-services/flux/flux-pricing/

To inquire about Flux allocations please email flux-support@umich.edu

2/14

cja 2014 24

Using Flux3. An MToken (or a Software Token)

Required for access to the login nodesImproves cluster security by requiring a second means of proving your identity

You can use either an MToken or an application for your mobile device (called a Software Token) for this

Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa

2/14

cja 2014 25

Logging in to Fluxssh flux-login.engin.umich.edu

MToken (or Software Token) required

You will be randomly connected a Flux login nodeCurrently flux-login1 or flux-login2

Firewalls restrict access to flux-login.To connect successfully, either

Physically connect your ssh client platform to the U-M campus wired or MWireless network, or

Use VPN software on your client platform, or

Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there

2/14

cja 2014 26

ModulesThe module command allows you to specify what versions of software you want to usemodule list -- Show loaded modulesmodule load name -- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all optionsEnter these commands at any time during your sessionA configuration file allows default module commands to be executed at login

Put module commands in file ~/privatemodules/defaultDon’t put module commands in your .bashrc / .bash_profile

2/14

cja 2014 27

Flux environment

The Flux login nodes have the standard GNU/Linux toolkit:

make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, …

Watch out for source code or data files written on non-Linux systems

Use these tools to analyze and convert source files to Linux formatfile

dos2unix2/14

cja 2014 28

Lab 1Task: Invoke R interactively on the login node

module load Rmodule list

Rq()

Please run only very small computations on the Flux login nodes, e.g., for testing

2/14

cja 2014 29

Lab 2Task: Run R in batch mode

module load R

Copy sample code to your login directorycdcp ~cja/hpc-sample-code.tar.gz .tar -zxvf hpc-sample-code.tar.gzcd ./hpc-sample-code

Examine Rbatch.pbs and Rbatch.R

Edit Rbatch.pbs with your favorite Linux editor

Change #PBS -M email address to your own

2/14

cja 2014 30

Lab 2Task: Run R in batch mode

Submit your job to Fluxqsub Rbatch.pbs

Watch the progress of your jobqstat -u uniqname

where uniqname is your own uniqname

When complete, look at the job’s outputless Rbatch.out

Copy your results to your local workstation (change uniqname to your own uniqname)scp uniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rbatch.out Rbatch.out

2/14

cja 2014 31

Lab 3Task: Use the multicore package

The multicore package allows you to use multiple cores on the same node

module load Rcd ~/sample-code

Examine Rmulti.pbs and Rmulti.R

Edit Rmulti.pbs with your favorite Linux editor

Change #PBS -M email address to your own

2/14

cja 2014 32

Lab 3Task: Use the multicore package

Submit your job to Fluxqsub Rmulti.pbs

Watch the progress of your jobqstat -u uniqname

where uniqname is your own uniqname

When complete, look at the job’s outputless Rmulti.out

Copy your results to your local workstation (change uniqname to your own uniqname)scp uniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rmulti.out Rmulti.out

2/14

cja 2014 33

Compiling CodeAssuming default module settings

Use mpicc/mpiCC/mpif90 for MPI code

Use icc/icpc/ifort with -mp for OpenMP code

Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90

Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o prog prog.cMPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o prog prog.cmpirun -np 2 ./prog

2/14

cja 2014 34

Lab 4Task: compile and execute simple programs on the Flux login node

Copy sample code to your login directory:cdcp ~brockp/cac-intro-code.tar.gz .tar -xvzf cac-intro-code.tar.gzcd ./cac-intro-code

Examine, compile & execute helloworld.f90:ifort -O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90./f90hello

Examine, compile & execute helloworld.c:icc -O3 -ipo -no-prec-div -xHost -o chello helloworld.c./chello

Examine, compile & execute MPI parallel code:mpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.cmpirun -np 2 ./c_ex01

2/14

cja 2014 35

MakefilesThe make command automates your code compilation processUses a makefile to specify dependencies between source and object filesThe sample directory contains a sample makefileTo compile c_ex01:make c_ex01To compile all programs in the directorymakeTo remove all compiled programsmake cleanTo make all the programs using 8 compiles in parallel make -j8

2/14

cja 2014 36

Flux Batch Operations

2/14

cja 2014 37

Portable Batch System

All production runs are run on the compute nodes using the Portable Batch System (PBS)

PBS manages all aspects of cluster job execution except job scheduling

Flux uses the Torque implementation of PBS

Flux uses the Moab scheduler for job scheduling

Torque and Moab work together to control access to the compute nodes

PBS puts jobs into queuesFlux has a single queue, named flux

2/14

cja 2014 38

Cluster workflowYou create a batch script and submit it to PBS

PBS schedules your job, and it enters the flux queue

When its turn arrives, your job will execute the batch script

Your script has access to any applications or data stored on the Flux cluster

When your job completes, anything it sent to standard output and error are saved and returned to you

You can check on the status of your job at any time, or delete it if it’s not doing what you want

A short time after your job completes, it disappears

2/14

cja 2014 39

Basic batch commands

Once you have a script, submit it:qsub scriptfile

$ qsub singlenode.pbs6023521.nyx.engin.umich.edu

You can check on the job status:qstat jobidqstat -u user$ qstat -u cjanyx.engin.umich.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----6023521.nyx.engi cja flux hpc101i -- 1 1 -- 00:05 Q --

To delete your jobqdel jobid

$ qdel 6023521$

2/14

cja 2014 40

Loosely-coupled batch script

#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l procs=12,pmem=1gb,walltime=01:00:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe

#Your Code Goes Below:cd $PBS_O_WORKDIRmpirun ./c_ex01

2/14

cja 2014 41

Tightly-coupled batch script

#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l nodes=1:ppn=12,mem=47gb,walltime=02:00:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe

#Your Code Goes Below:cd $PBS_O_WORKDIRmatlab -nodisplay -r script

2/14

cja 2014 42

Lab 5Task: Run an MPI job on 8 cores

Compile c_ex05cd ~/cac-intro-codemake c_ex05

Edit file run with your favorite Linux editorChange #PBS -M address to your own

I don’t want Brock to get your email!

Change #PBS -A allocation to FluxTraining_flux, or to your own allocation, if desired

Change #PBS -l allocation to flux

Submit your jobqsub run

2/14

cja 2014 43

PBS attributesAs always, man qsub is your friend

-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-A youralloc_flux: sets the allocation you are using-l qos=flux: sets the quality of service parameter-q flux: sets the queue you are submitting to-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-j oe: join STDOUT and STDERR to a common file

-I : allow interactive use-X : allow X GUI use

2/14

cja 2014 44

PBS resources (1)A resource (-l) can specify:

Request wallclock (that is, running) time-l walltime=HH:MM:SS

Request C MB of memory per core-l pmem=Cmb

Request T MB of memory for entire job-l mem=Tmb

Request M cores on arbitrary node(s)-l procs=M

Request a token to use licensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox

2/14

cja 2014 45

PBS resources (2)A resource (-l) can specify:

For multithreaded code:Request M nodes with at least N cores per node-l nodes=M:ppn=N

Request M cores with exactly N cores per node (note the differencevis a vis ppn syntax and semantics!)-l nodes=M,tpn=N(you’ll only use this for specific algorithms)

2/14

cja 2014 46

Interactive jobsYou can submit jobs interactively:

qsub -I -X -V -l procs=2 -l walltime=15:00 -A youralloc_flux -l qos=flux –q flux

This queues a job as usualYour terminal session will be blocked until the job runs

When your job runs, you'll get an interactive shell on one of your nodes

Invoked commands will have access to all of your nodes

When you exit the shell your job is deleted

Interactive jobs allow you toDevelop and test on cluster node(s)

Execute GUI tools on a cluster node

Utilize a parallel debugger interactively

2/14

cja 2014 47

Lab 6Task: Run an interactive job

Enter this command (all on one line):qsub -I -V -l procs=1 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux

When your job starts, you’ll get an interactive shell

Copy and paste the batch commands from the “run” file, one at a time, into this shell

Experiment with other commands

After thirty minutes, your interactive shell will be killed

2/14

cja 2014 48

Lab 7Task: Run Matlab interactively

module load matlab

Start an interactive PBS sessionqsub -I -V -l procs=2-l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux

Run Matlab in the interactive PBS sessionmatlab -nodisplay

2/14

cja 2014 49

Introduction to Scheduling

2/14

cja 2014 50

The Scheduler (1/3)

Flux scheduling policies:The job’s queue determines the set of nodes you run on

The job’s account and qos determine the allocation to be charged

If you specify an inactive allocation, your job will never run

The job’s resource requirements help determine when the job becomes eligible to run

If you ask for unavailable resources, your job will wait until they become free

There is no pre-emption

2/14

cja 2014 51

The Scheduler (2/3)Flux scheduling policies:

If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run:

How long you have waited for the resource

How much of the resource you have used so far

This is called “fairshare”

The scheduler will reserve nodes for a job with sufficient priority

This is intended to prevent starving jobs with large resource requirements

2/14

cja 2014 52

The Scheduler (3/3)Flux scheduling policies:

If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps

This is called “backfill”

Core

sTime

2/14

cja 2014 53

Gaining insightThere are several commands you can run to get some insight over the scheduler’s actions:

freenodes : shows the number of free nodes and cores currently available

mdiag -a youralloc_name : shows resources defined for your allocation and who can run against it

showq -w acct=yourallocname: shows jobs using your allocation (running/idle/blocked)

checkjob jobid : Can show why your job might not be starting

showstart -e all jobid : Gives you a coarse estimate of job start time; use the smallest value returned

2/14

cja 2014 54

More advanced scheduling

Job Arrays

Dependent Scheduling

2/14

cja 2014 55

Job Arrays• Submit copies of identical jobs• Invoked via qsub –t:

qsub –t array-spec pbsbatch.txt

Where array-spec can be

m-n

a,b,c

m-n%slotlimit

e.g.

qsub –t 1-50%10 Fifty jobs, numbered 1 through 50,

only ten can run simultaneously

• $PBS_ARRAYID records array identifier

2/14

cja 2014 56

Dependent scheduling

• Submit jobs whose execution scheduling depends on other jobs

• Invoked via qsub –W:qsub -W depend=type:jobid[:jobid]…

Where depend can be

after Schedule after jobids have started

afterok Schedule after jobids have finished, only if no errors

afternotok Schedule after jobids have finished, only if errors

afterany Schedule after jobids have finished, regardless of status

before,beforeok,beforenotok,beforeany 2/14

cja 2014 57

Dependent scheduling

Where depend can be (cont’t)

before When this job has started, jobids will be scheduled

beforeok After this job completes without errors, jobids will be scheduled

beforenotok After this job completes without errors, jobids will be scheduled

afterany After this job completes, regardless of status, jobids will be scheduled

2/14

cja 2014 58

Some Flux Resources

http://arc.research.umich.edu/resources-services/flux/

U-M Advanced Research Computing Flux pages

http://cac.engin.umich.edu/CAEN HPC Flux pages

http://www.youtube.com/user/UMCoECACCAEN HPC YouTube channel

For assistance: flux-support@umich.eduRead by a team of people including unit support staffCannot help with programming questions, but can help with operational Flux and basic usage questions

2/14

cja 2014 59

SummaryThe Flux cluster is just a collection of similar Linux machines connected together to run your code, much faster than your desktop can

Command-line scripts are queued by a batch system and executed when resources become available

Some important commands are

qsubqstat -u usernameqdel jobidcheckjob

Develop and test, then submit your jobs in bulk and let the scheduler optimize their execution

2/14

cja 2014 60

Any Questions?Charles J. AntonelliLSAIT Advocacy and Research Supportcja@umich.eduhttp://www.umich.edu/~cja734 763 0607

2/14

cja 2014 61

References1. http://arc.research.umich.edu/resources-services/flux/

2. http://arc.research.umich.edu/flux/hardware-services/

3. http://cac.engin.umich.edu/resources/software/R.html

4. http://cac.engin.umich.edu/resources/software/matlab.html

5. CAC supported Flux software, http://cac.engin.umich.edu/resources/software/flux-software (accessed August 2013)6. J. L. Gustafson, “Reevaluating Amdahl’s Law,” chapter for book, Supercomputers and Artificial Intelligence, edited

by Kai Hwang, 1988. http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html (accessed November 2011).

7. Mark D. Hill and Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, vol. 41, no. 7, pp. 33-38, July 2008. http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf (accessed November 2011).

8. InfiniBand, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011).9. Intel C and C++ Compiler 1.1 User and Reference Guide,

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011).

10. Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011).

11. Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011).12. Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011).13. Jurg van Vliet & Flvia Paginelli, Programming Amazon EC2,’Reilly Media, 2011. ISBN 978-1-449-39368-7.

2/14

Recommended