When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing...
If you can't read please download the document
When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692,
When and How to Use Large-Scale Computing: CHTC and HTCondor
Lauren Michael, Research Computing Facilitator Center for High
Throughput Computing STAT 692, November 15, 2013
Slide 2
Why to Access Large-Scale Computing resources CHTC Services and
Campus-Shared Computing What is High-Throughput Computing (HTC)?
What is HTCondor and How Do You Use It? Maximizing Computational
Throughput How to Run R on Campus-Shared Resources Topics Well
Cover Today 2
Slide 3
1. your computing work wont run at all on your computer(s)
(lack sufficient RAM, disk, etc.) 2. your computing work will take
too long on your own computer(s) 3. you would like to off-load
certain processes in favor of running others on your computer(s)
When should you use outside computing resources? 3
Slide 4
Center for High Throughput Computing, est. 2006 Large-scale,
campus-shared computing systems high-throughput computing (HTC)
grid and high-performance computing (HPC) cluster all standard
services provided free-of-charge automatic access to the national
Open Science Grid (OSG) hardware buy-in options for priority access
information about other computing resources Support for using our
systems consultation services, training, and proposal assistance
solutions for numerous software (including Python, Matlab, R) CHTC
Services 4
Slide 5
HTCondor: CHTCs R&D Arm R&D for HTCondor and other HTC
software Services provided to the campus community HTC Software
HTCondor: manage your compute cluster DAGMan: manage computing
workflows Bosco: submit locally, run globally Software Engineering
Expertise & Consulting CHTC-operated Build-and-Test Lab
(BaTLab) Software Security Consulting Your Problems become Our
Research!
Slide 6
Jul10- Jun11 Jul11- Jun12 Jul12- Jun13 Quick Facts
457097Million Hours Served 54106120Research Projects 3552
Departments 101315Off-Campus Projects Researchers who use the CHTC
are located all over campus (red buildings)
http://chtc.cs.wisc.edu
Slide 7
Director, Miron Livny [email protected] (also OSG Technical
Director and WIDs CTO) Campus Support: [email protected] 2+ Research
Computing Facilitators Lauren Michael (lead) [email protected] 3
Systems Administrators +4-8 Part-time Students HTCondor Development
Team OSG Software Team CHTC Staff 7
Slide 8
high-throughput computing (HTC) many independent processes that
can run on 1 or few processors (cores or threads) on the same
computer mostly standard programming methods best accelerated by:
access to as many cores as possible high-performance computing
(HPC) sharing the workload of interdependent processes over
multiple cores to reduce overall compute time OpenMP and MPI
programming methods, or multi-thread requires: access to many
servers of cores within the same tightly-networked cluster; access
to shared files HTC versus HPC 8
Slide 9
essentially means: spread computing work out over multiple
processors Use of the words parallel and parallelize can apply to
HTC or HPC when referring to programs Its important to be clear!
parallel is confusing 9
Slide 10
Why to Access Large-Scale Computing resources CHTC Services and
Campus-Shared Computing What is High-Throughput Computing (HTC)?
What is HTCondor and How Do You Use It? Maximizing Computational
Throughput How to Run R on Campus-Shared Resources Topics Well
Cover Today 10
Slide 11
match-maker of computing work and computers job scheduler
matches are made based upon necessary RAM, CPUs, disk space, etc.,
as requested by the user jobs re-run if interrupted works beyond
clusters to coordinate distributed computers for maximum throughput
coordinates data transfers between users and distributed computers
can coordinate servers, desktops, and laptops What is HTCondor?
11
Slide 12
Queue job1.1user1 job1.2user1 job2.1user2 Submit Node(s) (where
jobs are submitted) input How HTCondor Works Central Manager (of
the pool) Execute Node(s) (where jobs run) Machine ClassAd Job
ClassAd output 12 input
Slide 13
13
Slide 14
Submit hostCS PoolCHTC PoolCampus GridOpen Science Grid Stat
dept servers default simon.stat.wisc.edu default CHTC submit nodes
defaultflockingglidein Submit nodes available to YOU 14
Slide 15
Prepare programs and files Write submit file(s) Submit jobs to
the queue Monitor the jobs (Remove bad jobs) Basic HTCondor
Submission 15
Slide 16
Make programs portable compile code to a simple binary
statically-link code dependencies consider CHTCs tools for
packaging Matlab, Python, and R Consider using a shell script (or
other wrapper) to run multiple commands for you create a local
install of software set environment variables then, run your code
Stage all files on a submit node Preparing Programs and Files
16
Slide 17
1. Cut up computing work into many independent pieces (CHTC can
consult) 2. Make programs portable, minimize dependencies (CHTC can
consult, or may have prepared solutions) 3. Learn how to submit
jobs (CHTC can help you a lot!) 4. Maximize your overall throughput
on available computational resources (CHTC can help you a lot!) HTC
Components 17
Slide 18
# This is a comment universe = vanilla output = process.out
error = process.err log = process.log executable = cosmos arguments
= cosmos.in 4 should_transfer_files = YES transfer_input_files =
cosmos.in when_to_transfer_output = ON_EXIT request_memory = 100
request_disk = 100000 request_cpus = 1 queue Basic HTCondor Submit
File 18 basic jobs are vanilla universe executable is your single
program or a shell script log is where HTCondor stores info about
how your job ran output and error are where system output and error
will go The program will be run as:./cosmos cosmos.in 4 The program
will be run as:./cosmos cosmos.in 4 queue with no number after it
will submit only one job memory in MB and disk in KB
Submitting Jobs 22 [lmichael@simon test]$ condor_submit
submit.txt Submitting job(s)... 3 job(s) submitted to cluster
29747. [lmichael@simon test]$
Slide 23
Checking the Queue 23 [lmichael@simon test]$ condor_q lmichael
-- Submitter: simon.stat.wisc.edu : : simon.stat.wisc.edu ID OWNER
SUBMITTED RUN_TIME ST PRI SIZE CMD 29747.0 lmichael 2/15 09:06
0+00:01:34 R 0 9.8 cosmos cosmos.in 29747.1 lmichael 2/15 09:06
0+00:00:00 I 0 9.8 cosmos cosmos.in 29747.2 lmichael 2/15 09:06
0+00:00:00 I 0 9.8 cosmos cosmos.in 3 jobs; 0 completed, 0 removed,
2 idle, 1 running, 0 held, 0 suspended [lmichael@simon test]$ View
all user jobs in the queue: condor_q
Slide 24
Log Files 24 000 (29747.001.000) 02/15 09:29:17 Job submitted
from host:... 001 (29747.001.000) 02/15 09:33:59 Job executing on
host:... 005 (29747.001.000) 02/15 09:39:01 Job terminated. (1)
Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00
- Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00,
Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 -
Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total
Bytes Received By Job Partitionable Resources : Usage Request
Allocated Cpus : 1 1 Disk (KB) : 225624 100000 645674 Memory (MB) :
85 1000 1024
Slide 25
Remove a single job: condor_rm 29747.0 Remove all jobs of a
cluster: condor_rm 29747 Remove all of your jobs: condor_rm
lmichael Removing Jobs 25
Slide 26
Why to Access Large-Scale Computing resources CHTC Services and
Campus-Shared Computing What is High-Throughput Computing (HTC)?
What is HTCondor and How Do You Use It? Maximizing Computational
Throughput How to Run R on Campus-Shared Resources Topics Well
Cover Today 26
Slide 27
The Philosophy of HTC The Art of HTC Other Best-Practices
Maximizing Throughput 27
Slide 28
break up your work into many smaller jobs single CPU, short run
times, small input/output data run on as many processors as
possible single CPU and low RAM needs take everything with you;
make programs portable use the right submit node for the right
resources automate as much as you can (share your processors with
others to increase everyones throughput) The Philosophy of HTC
28
Slide 29
Edgar Spalding: studies effect of gene on plant growth outcomes
GeoDeepDive Project: extracts and comprises dark data from PDFs of
publications in Geosciences We want HTC to revolutionize your
research! Success Stories 29
Slide 30
carrying out the philosophy, well Tuning job requests for
memory and disk Matching run times to the maximum number of
available processors Automation The Art of HTC 30
Slide 31
Problem: Dont know what your job needs? If you dont ask for
enough memory and disk: Your jobs will be kicked off for going
over, and will have to be retried (though, HTCondor will
automatically request more for you) If you ask for too much: Your
jobs wont match to as many available slots as they could Tuning Job
Resource Requests 31
Slide 32
Solution: Testing is Key!!! 1. Run just a few jobs at first to
determine memory and disk needs from log files If your first
request is not enough, HTCondor will retry the jobs and request
more until they finish. Its okay to request a lot (1 GB each) for a
few tests. 2. Change the request lines to a better value 3. Submit
a large batch Tuning Job Resource Requests 32
Slide 33
Submit hostCS Pool (4 hrs?) CHTC Pool
Remember that you are sharing with others Be Kind to Your
Submit Node avoid transfers of large files through the submit node
(large: >10GB per batch; ~10 MB/job x 1000+ jobs) transfer files
from another server as part of your job ( wget and curl ) compress
where appropriate; delete unnecessary files remember: new files are
copied back to submit nodes avoid running multiple CPU-intensive
executables Test all new batches, and scale up gradually 3 jobs,
then 100s, then 1000s, then Non-Throughput Considerations 37
Slide 38
Why to Access Large-Scale Computing resources CHTC Services and
Campus-Shared Computing What is High-Throughput Computing (HTC)?
What is HTCondor and How Do You Use It? Maximizing Computational
Throughput How to Run R on Campus-Shared Resources Topics Well
Cover Today 38
Slide 39
Problem: R programs dont easily compile to a binary Solution:
Take R with your job! CHTC has tools just for R (and Python, and
Matlab) Installed on CS/Stat submit nodes, simon, and CHTC submit
nodes Running R on HTC Resources: The Best Way 39
Slide 40
40
Slide 41
Copy your R code and any R library tar.gz files to the submit
node Run the following command: chtc_buildRlibs
--rversion=sl5-R-2.10.1 \ --.tar.gz,.tar.gz R versions supported:
2.10.1, 2.13.1, 2.15.1 (use the closest version below yours) Get
back sl5-RLIBS.tar.gz and sl6-RLIBS.tar.gz (youll use these in the
next step) 1. Build R Code with chtc_buildRlibs 41
Slide 42
42
Slide 43
download ChtcRun.tar.gz, according to the guide ( wget ) un-tar
it: tar xzf ChtcRun.tar.gz View ChtcRun contents: process.template
(submit file template) mkdag (script that will create jobs based
upon your staged data) Rin/ (example data staging folder) 2.
Download the ChtcRun Package 43
Slide 44
Stage data as such: ChtcRun/ data/ 1/ input.in 2/ input.in
job3/ input.in test4/ input.in shared/.R Modify process.template
with respect to: request_memory and request_disk, if you know
+WantFlocking = true OR +WantGlidein = true 3. Prepare data and
process.template 44
Slide 45
In ChtcRun, execute the mkdag script (Examples at the top of
./mkdag --help )./mkdag --data=Rin outputdir=Rout \
--cmdtorun=soartest.R --type=R \ --version=R-2.10.1 --pattern=meanx
pattern indicates a portion of a filename that you expect to be
created by successful completion of any single job A successful
mkdag run will instruct you to navigate to the outputdir, and
submit the jobs as a single DAG: condor_submit_dag mydag.dag 4. Run
mkdag and submit jobs 45
Slide 46
Check jobs in the queue as theyre gradually added and completed
( condor_q ) Check other files in your outputdir: Rout/
mydag.dag.dagman.out (updated table of job stats) 1/ process.log
process.out,err ChtcWrapper1.out 2/ process.log process.out,err
ChtcWrapper2.out / After testing a small number of jobs, submit
many! (up to many 10,000s; # submitted is throttled for you) 5.
Monitor Job Completion 46
Slide 47
1. Use a Stat server to submit shorter jobs to the CS pool. 2.
Obtain access to simon.stat.wisc.edu from Mike Camilleri
([email protected]), and submit longer jobs to the CHTC
[email protected] 3. Meet with the CHTC to submit jobs to
the entire UW Grid and to the national Open Science Grid.
chtc.cs.wisc.edu, click Get Started User support for HTCondor users
at UW: [email protected] What Next? 47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
1. Use a Stat server to submit shorter jobs to the CS pool. 2.
Obtain access to simon.stat.wisc.edu from Mike Camilleri
([email protected]), and submit longer jobs to the CHTC
[email protected] 3. Meet with the CHTC to submit jobs to
the entire UW Grid and to the national Open Science Grid.
chtc.cs.wisc.edu, click Get Started User support for HTCondor users
at UW: [email protected] What Next? 51