U sing ITaP clusters for large scale statistical analysis with R

Using ITaP clusters for large scale statistical

analysis with R

Doug CrabillPurdue University

Topics

• Running multiple R jobs on departmental Linux servers serially, and in parallel• Cluster concepts and terms• Use cluster to run same R program many times• Use cluster to run same R program many times with

different parameters• Running jobs on an entire cluster node

Invoking R in a batch mode

• R CMD BATCH vs Rscript• Rscript t.R > t.out• Rscript t.R > t.out & # run in background• nohup Rscript t.R > t.out & # run in

background, even if logout• Can launch several such jobs with &’s simultaneously,

but best to stick with between 2 and 8 jobs per departmental server.• Invoking manually doesn’t scale well for running

dozens of jobs

Launching several R jobs serially• Create a file like “run.sh” that contains several

Rscript invocations. The first job will run until it completes, then the second job will run, etc.

• Do NOT use “&” at the end or it could crash server

Rscript t.R > tout.001Rscript t.R > tout.002Rscript t.R > tout.003Rscript t.R > tout.004Rscript t.R > tout.005Rscript t.R > tout.006Rscript t.R > tout.007Rscript t.R > tout.008

Creating the “run.sh” script programatically• Extra cool points for creating the “run.sh” script

using R> sprintf("Rscript t.R > tout.%03d", 1:8)[1] "Rscript t.R > tout.001" "Rscript t.R > tout.002" "Rscript t.R > tout.003"[4] "Rscript t.R > tout.004" "Rscript t.R > tout.005" "Rscript t.R > tout.006"[7] "Rscript t.R > tout.007" "Rscript t.R > tout.008"> write(sprintf("Rscript t.R > tout.%03d", 1:8), "run.sh")

Invoking “run.sh”

• sh run.sh # run every job in dorun.sh one at a time• nohup sh run.sh & # run every job in

dorun.sh one at a time, and keep running even if logout• nohup xargs -d '\n' -n1 -P4 sh -c < run.sh & # run every job in run.sh keeping 4 jobs running simultaneously keep running even if logout

Supercomputers and clusters• Supercomputer = A collection of computer nodes in a

cluster managed by scheduler• Node = one computer in a cluster (dozens to hundreds)• Core = one CPU core in a Node (often 8 to 48 cores per

node)• Front end = one or more computers used for launching

jobs on the cluster• PBS / Torque is the scheduling software. PBS is like the

maître d’, seating groups of various sizes for varying times at available tables, with a waiting list, a bar, reservations, bad customers that spoil the party

ITaP / RCAC clusters

• Conte – was fastest supercomputer on any academic campus in the world when built June 2013. Has Intel Phi coprocessors• Carter – Has NVIDIA GPU-accelerated nodes• Hansen, Rossmann, Coates• Scholar – uses part of Carter, for instructional use• Hathi – Hadoop• Radon – Accessible by all researchers on campus• Anecdotes… <queue patriotic music>

More info on radon cluster

• https://www.rcac.purdue.edu/ then select Computation->Radon

• Read the User’s guide on left sidebar

Sub-Cluster

Number of

Nodes

Processors per Node

Cores per

Node

Memory per Node

Interconnect

Theoretical

Peak TeraFLOPS

Radon-D 30

Two 2.33 GHz Quad-Core Intel E5410

8 16 GB 1 GigE 58.2

https://www.rcac.purdue.edu/

https://www.rcac.purdue.edu/

Logging into radon

• Get accounts on the RCAC website previously mentioned or ask me• Make an SSH connection to radon.rcac.purdue.edu• From Linux (or Mac terminal), type this to log into

one of the cluster front ends (as user dgc):• ssh –X radon.rcac.purdue.edu –l dgc

• Do not run big jobs on the front ends! They are only used for submitting jobs to the cluster and light testing and debugging

File storage on radon

• Home directory quota is ~10GB (type “myquota”)• Can be increased to 100GB via Boiler Backpack

settings at http://www.purdue.edu/boilerbackpack• Scratch storage of around 1TB per user. This

directory differs per user, and is accessible by the $RCAC_SCRATCH environment variable:

• All nodes can see all files in home and scratch

radon-fe01 ~ $ cd $RCAC_SCRATCHradon-fe01 /scratch/radon/d/dgc $

http://www.purdue.edu/boilerbackpack

http://www.purdue.edu/boilerbackpack

Software on radon

• List of applications installed on radon can be found in the users guide previously mentioned• The module command is used to “load” software

packages for use by the current login session• module avail #See the list of applications

available• module load r # Add “R”• Must be included as part of every R job to be run

on the cluster

PBS scheduler commands

• qstat # See list of jobs in the queue• qstat –u dgc # See list of jobs in the queue

submitted by dgc• qsub jobname.sh # Submit jobname.sh to run

on the cluster• qdel JOBIDNUMBER # delete a previously

submitted job from the queue

Simple qsub submission file• Qsub accepts command line arguments, or

embedded comments that are ignored by the script, but honored by qsub.

• The JOBID of this particular job is 683369

radon-fe01 ~/cluster $ cat myjob.sh #!/bin/sh -l#PBS -l nodes=1:ppn=1#PBS -l walltime=00:10:00

cd $PBS_O_WORKDIR/bin/hostnameradon-fe01 ~/cluster $ qsub myjob.sh683369.radon-adm.rcac.purdue.edu

Viewing status and the results• Use qstat or qstat -u dgc to check job status• Output of job 683369 goes to myjob.sh.o683369• Errors from job 683369 goes to myjob.sh.e683369• Inconvenient to collect the results from a

dynamically named file like myjob.sh.o683369. Best to write output to a filename of your choosing by writing directly to filenames of your choosing in your R program or directing the output to a file in your job submission file

Our first R job submission

• Say we want to run the R program t.R on radon:

• Create R1.sh with contents:

• Submit using qsub R1.sh

summary(1 + rgeom(10^7, 1/1000 ))

#!/bin/sh -l#PBS -l nodes=1:ppn=1#PBS -l walltime=00:10:00

cd $PBS_O_WORKDIRmodule add rRscript t.R > out1

Let’s do that 100 times

• Using our “R1.sh” file as a template, create files prog001.sh through prog100.sh, changing the output file for each job to be out.NNN. In R:

• Submit all 100 jobs by typing sh –x runall.sh• Generating files using bash instead (all on one line):

s<-scan("R1.sh", what='c', sep="\n")sapply(1:100, function(i) { s[6]=sprintf("Rscript t.R > out.%03d", i); write(s, sprintf("prog%03d.sh", i)); })write(sprintf("qsub prog%03d.sh", 1:100), "runall.sh")

for i in `seq -w 1 100`; do (head -6 R1.sh; echo "Rscript t.R > out.$i") > prog$i.sh; echo "qsub prog$i.sh"; done > runall.sh

Coupon collector problem

• I want to solve the coupon collector problem with large parameters but it will take much too long on a single computer (around 2.5 days):

• The obvious approach is to break it into 10,000 smaller R jobs and submit them to the cluster.• Better to break it into 250 jobs, each operating on 40

numbers.• Create an R script that accepts command line

arguments to process many numbers at a time. Estimate walltime carefully!

sum(sapply(1:10000, function(y) {mean(1 + rgeom(10^8, y/10000))}))

Coupon collector R code

• t2.R, read arguments into “args”, process each

• Can test via:

• Generate 250 scripts with 40 arguments each

args <- commandArgs(TRUE)sapply(as.integer(args), function(y) {mean(1 + rgeom(10^8, y/10000))})

Rscript t2.R 100 125 200 # Change reps from 10^8 to 10^5 for test

s<-scan("R2.sh", what='c', sep="\n")sapply(1:250, function(y) { s[6]=sprintf("Rscript t2.R %s > out.%03d", paste((y*40-39):(y*40), collapse=" "), y); write(s, sprintf("prog%03d.sh", y));}) write(sprintf("qsub prog%03d.sh", 1:250), "runall.sh")

Coupon collector results

• Output is in the files out.001 through out.250:

• It’s hard to read 250 files with that stupid leading column. UNIX tricks to the rescue!

• Cha-ching!

radon-fe00 ~/cluster/R2done $ cat out.001 [1] 9999.8856 5000.4830 3333.0443 2499.8564 1999.7819 1666.2517 1428.6594 [8] 1249.9841 1110.9790 1000.0408 909.1430 833.3409 769.1818 714.2486[15] 666.6413 624.9357 588.3044 555.5487 526.3795 500.0021 476.2695[22] 454.5702 434.7949 416.6470 399.9255 384.5739 370.3412 357.1366[29] 344.8375 333.2978 322.5507 312.5258 303.0307 294.1573 285.7368[36] 277.8168 270.2709 263.1612 256.3872 249.9905

sum(scan(pipe("cat out* | colrm 1 5"))) # works for small indexes onlysum(scan(pipe("cat out* | sed -e 's/.*]//'"))) # works for all index sizes

Using all cores on a single node• When running your job on a single core of a node

shared with strangers, some may misbehave and use too much RAM or CPU. Solution is to request entire nodes, and fill them with just your jobs so you never share a node with anyone else.• Job submission file should include:

• Forces PBS to exclusively schedule a node for you. If it is a single R job, you are using just one core! Must use xargs or a similar trick to launch 8 simultaneous R jobs. Only submit 1/8th the jobs.

#PBS -l nodes=1:ppn=8

All cores example one


cd $PBS_O_WORKDIRmodule add rRscript t3.R >out1 &Rscript t3.R >out2 &Rscript t3.R >out3 &Rscript t3.R >out4 &Rscript t3.R >out5 &Rscript t3.R >out6 &Rscript t3.R >out7 &Rscript t3.R >out8 &wait

All cores example two

• Where batch1.sh contains (could be > 8 lines!):


cd $PBS_O_WORKDIRmodule add rxargs -d '\n' -n1 -P8 sh -c < batch1.sh

Rscript t3.R >out1Rscript t3.R >out2Rscript t3.R >out3Rscript t3.R >out4Rscript t3.R >out5Rscript t3.R >out6Rscript t3.R >out7Rscript t3.R >out8

Thanks!

• Thanks to Prof. Mark Daniel Ward for all his help with the examples used in this talk!• URL for these notes:• http://www.stat.purdue.edu/~dgc/cluster.pptx• http://www.stat.purdue.edu/~dgc/cluster.pdf

(copy and paste works poorly with PDF!)

http://www.stat.purdue.edu/~dgc/cluster.pptx

http://www.stat.purdue.edu/~dgc/cluster.pdf

http://www.stat.purdue.edu/~dgc/cluster.pdf

Documents

U sing ITaP clusters for large scale statistical analysis with R