Upload
aidan-cochran
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
U sing ITaP clusters for large scale statistical analysis with R. Doug Crabill Purdue University. Topics. Running multiple R jobs on departmental Linux servers serially, and in parallel Cluster concepts and terms Use cluster to run same R program many times - PowerPoint PPT Presentation
Citation preview
Using ITaP clusters for large scale statistical
analysis with R
Doug CrabillPurdue University
Topics
• Running multiple R jobs on departmental Linux servers serially, and in parallel• Cluster concepts and terms• Use cluster to run same R program many times• Use cluster to run same R program many times with
different parameters• Running jobs on an entire cluster node
Invoking R in a batch mode
• R CMD BATCH vs Rscript• Rscript t.R > t.out• Rscript t.R > t.out & # run in background• nohup Rscript t.R > t.out & # run in
background, even if logout• Can launch several such jobs with &’s simultaneously,
but best to stick with between 2 and 8 jobs per departmental server.• Invoking manually doesn’t scale well for running
dozens of jobs
Launching several R jobs serially• Create a file like “run.sh” that contains several
Rscript invocations. The first job will run until it completes, then the second job will run, etc.
• Do NOT use “&” at the end or it could crash server
Rscript t.R > tout.001Rscript t.R > tout.002Rscript t.R > tout.003Rscript t.R > tout.004Rscript t.R > tout.005Rscript t.R > tout.006Rscript t.R > tout.007Rscript t.R > tout.008
Creating the “run.sh” script programatically• Extra cool points for creating the “run.sh” script
using R> sprintf("Rscript t.R > tout.%03d", 1:8)[1] "Rscript t.R > tout.001" "Rscript t.R > tout.002" "Rscript t.R > tout.003"[4] "Rscript t.R > tout.004" "Rscript t.R > tout.005" "Rscript t.R > tout.006"[7] "Rscript t.R > tout.007" "Rscript t.R > tout.008"> write(sprintf("Rscript t.R > tout.%03d", 1:8), "run.sh")
Invoking “run.sh”
• sh run.sh # run every job in dorun.sh one at a time• nohup sh run.sh & # run every job in
dorun.sh one at a time, and keep running even if logout• nohup xargs -d '\n' -n1 -P4 sh -c < run.sh & # run every job in run.sh keeping 4 jobs running simultaneously keep running even if logout
Supercomputers and clusters• Supercomputer = A collection of computer nodes in a
cluster managed by scheduler• Node = one computer in a cluster (dozens to hundreds)• Core = one CPU core in a Node (often 8 to 48 cores per
node)• Front end = one or more computers used for launching
jobs on the cluster• PBS / Torque is the scheduling software. PBS is like the
maître d’, seating groups of various sizes for varying times at available tables, with a waiting list, a bar, reservations, bad customers that spoil the party
ITaP / RCAC clusters
• Conte – was fastest supercomputer on any academic campus in the world when built June 2013. Has Intel Phi coprocessors• Carter – Has NVIDIA GPU-accelerated nodes• Hansen, Rossmann, Coates• Scholar – uses part of Carter, for instructional use• Hathi – Hadoop• Radon – Accessible by all researchers on campus• Anecdotes… <queue patriotic music>
More info on radon cluster
• https://www.rcac.purdue.edu/ then select Computation->Radon
• Read the User’s guide on left sidebar
Sub-Cluster
Number of
Nodes
Processors per Node
Cores per
Node
Memory per Node
Interconnect
Theoretical
Peak TeraFLOPS
Radon-D 30
Two 2.33 GHz Quad-Core Intel E5410
8 16 GB 1 GigE 58.2
Logging into radon
• Get accounts on the RCAC website previously mentioned or ask me• Make an SSH connection to radon.rcac.purdue.edu• From Linux (or Mac terminal), type this to log into
one of the cluster front ends (as user dgc):• ssh –X radon.rcac.purdue.edu –l dgc
• Do not run big jobs on the front ends! They are only used for submitting jobs to the cluster and light testing and debugging
File storage on radon
• Home directory quota is ~10GB (type “myquota”)• Can be increased to 100GB via Boiler Backpack
settings at http://www.purdue.edu/boilerbackpack• Scratch storage of around 1TB per user. This
directory differs per user, and is accessible by the $RCAC_SCRATCH environment variable:
• All nodes can see all files in home and scratch
radon-fe01 ~ $ cd $RCAC_SCRATCHradon-fe01 /scratch/radon/d/dgc $
Software on radon
• List of applications installed on radon can be found in the users guide previously mentioned• The module command is used to “load” software
packages for use by the current login session• module avail #See the list of applications
available• module load r # Add “R”• Must be included as part of every R job to be run
on the cluster
PBS scheduler commands
• qstat # See list of jobs in the queue• qstat –u dgc # See list of jobs in the queue
submitted by dgc• qsub jobname.sh # Submit jobname.sh to run
on the cluster• qdel JOBIDNUMBER # delete a previously
submitted job from the queue
Simple qsub submission file• Qsub accepts command line arguments, or
embedded comments that are ignored by the script, but honored by qsub.
• The JOBID of this particular job is 683369
radon-fe01 ~/cluster $ cat myjob.sh #!/bin/sh -l#PBS -l nodes=1:ppn=1#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIR/bin/hostnameradon-fe01 ~/cluster $ qsub myjob.sh683369.radon-adm.rcac.purdue.edu
Viewing status and the results• Use qstat or qstat -u dgc to check job status• Output of job 683369 goes to myjob.sh.o683369• Errors from job 683369 goes to myjob.sh.e683369• Inconvenient to collect the results from a
dynamically named file like myjob.sh.o683369. Best to write output to a filename of your choosing by writing directly to filenames of your choosing in your R program or directing the output to a file in your job submission file
Our first R job submission
• Say we want to run the R program t.R on radon:
• Create R1.sh with contents:
• Submit using qsub R1.sh
summary(1 + rgeom(10^7, 1/1000 ))
#!/bin/sh -l#PBS -l nodes=1:ppn=1#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIRmodule add rRscript t.R > out1
Let’s do that 100 times
• Using our “R1.sh” file as a template, create files prog001.sh through prog100.sh, changing the output file for each job to be out.NNN. In R:
• Submit all 100 jobs by typing sh –x runall.sh• Generating files using bash instead (all on one line):
s<-scan("R1.sh", what='c', sep="\n")sapply(1:100, function(i) { s[6]=sprintf("Rscript t.R > out.%03d", i); write(s, sprintf("prog%03d.sh", i)); })write(sprintf("qsub prog%03d.sh", 1:100), "runall.sh")
for i in `seq -w 1 100`; do (head -6 R1.sh; echo "Rscript t.R > out.$i") > prog$i.sh; echo "qsub prog$i.sh"; done > runall.sh
Coupon collector problem
• I want to solve the coupon collector problem with large parameters but it will take much too long on a single computer (around 2.5 days):
• The obvious approach is to break it into 10,000 smaller R jobs and submit them to the cluster.• Better to break it into 250 jobs, each operating on 40
numbers.• Create an R script that accepts command line
arguments to process many numbers at a time. Estimate walltime carefully!
sum(sapply(1:10000, function(y) {mean(1 + rgeom(10^8, y/10000))}))
Coupon collector R code
• t2.R, read arguments into “args”, process each
• Can test via:
• Generate 250 scripts with 40 arguments each
args <- commandArgs(TRUE)sapply(as.integer(args), function(y) {mean(1 + rgeom(10^8, y/10000))})
Rscript t2.R 100 125 200 # Change reps from 10^8 to 10^5 for test
s<-scan("R2.sh", what='c', sep="\n")sapply(1:250, function(y) { s[6]=sprintf("Rscript t2.R %s > out.%03d", paste((y*40-39):(y*40), collapse=" "), y); write(s, sprintf("prog%03d.sh", y));}) write(sprintf("qsub prog%03d.sh", 1:250), "runall.sh")
Coupon collector results
• Output is in the files out.001 through out.250:
• It’s hard to read 250 files with that stupid leading column. UNIX tricks to the rescue!
• Cha-ching!
radon-fe00 ~/cluster/R2done $ cat out.001 [1] 9999.8856 5000.4830 3333.0443 2499.8564 1999.7819 1666.2517 1428.6594 [8] 1249.9841 1110.9790 1000.0408 909.1430 833.3409 769.1818 714.2486[15] 666.6413 624.9357 588.3044 555.5487 526.3795 500.0021 476.2695[22] 454.5702 434.7949 416.6470 399.9255 384.5739 370.3412 357.1366[29] 344.8375 333.2978 322.5507 312.5258 303.0307 294.1573 285.7368[36] 277.8168 270.2709 263.1612 256.3872 249.9905
sum(scan(pipe("cat out* | colrm 1 5"))) # works for small indexes onlysum(scan(pipe("cat out* | sed -e 's/.*]//'"))) # works for all index sizes
Using all cores on a single node• When running your job on a single core of a node
shared with strangers, some may misbehave and use too much RAM or CPU. Solution is to request entire nodes, and fill them with just your jobs so you never share a node with anyone else.• Job submission file should include:
• Forces PBS to exclusively schedule a node for you. If it is a single R job, you are using just one core! Must use xargs or a similar trick to launch 8 simultaneous R jobs. Only submit 1/8th the jobs.
#PBS -l nodes=1:ppn=8
All cores example one
#!/bin/sh -l#PBS -l nodes=1:ppn=8#PBS -l walltime=00:30:00
cd $PBS_O_WORKDIRmodule add rRscript t3.R >out1 &Rscript t3.R >out2 &Rscript t3.R >out3 &Rscript t3.R >out4 &Rscript t3.R >out5 &Rscript t3.R >out6 &Rscript t3.R >out7 &Rscript t3.R >out8 &wait
All cores example two
• Where batch1.sh contains (could be > 8 lines!):
#!/bin/sh -l#PBS -l nodes=1:ppn=8#PBS -l walltime=00:30:00
cd $PBS_O_WORKDIRmodule add rxargs -d '\n' -n1 -P8 sh -c < batch1.sh
Rscript t3.R >out1Rscript t3.R >out2Rscript t3.R >out3Rscript t3.R >out4Rscript t3.R >out5Rscript t3.R >out6Rscript t3.R >out7Rscript t3.R >out8
Thanks!
• Thanks to Prof. Mark Daniel Ward for all his help with the examples used in this talk!• URL for these notes:• http://www.stat.purdue.edu/~dgc/cluster.pptx• http://www.stat.purdue.edu/~dgc/cluster.pdf
(copy and paste works poorly with PDF!)