Ian C. Smith* Introduction to research computing using the High Performance Computing facilities and...

Ian C. Smith*

Introduction to research computing using the High Performance Computing

facilities and Condor

*Advanced Research Computing

University of Liverpool

Overview

Introduction to research computing

High Performance Computing (HPC)

HPC facilities at Liverpool

High Throughput Computing using Condor

University of Liverpool Condor Service

Some research computing examples using Condor

Next steps

What’s special about research computing ?

Often researchers need to tackle problems which are far too demanding for a typical PC or laptop computer

Programs may take too long to run or …

require too much memory or …

too much storage (disk space) or …

all of these !

Special computer systems and programming methods can help overcome these barriers

Speeding things up Key to reducing run times is parallelism - splitting large problems

into smaller tasks which can be tackled at the same time (i.e. “in parallel” or “concurrently”)

Two main types of parallelism:

data parallelism

functional parallelism (pipelining)

Tasks may be independent or inter-dependent (this eventually limits the speed up which can be achieved)

Fortunately many problems in medical/bio/life science exhibit data parallelism and tasks can be performed independently

This can lead to very significant speed ups !

Some sources of parallelism

Analysing patient data from clinical trials

Repeating calculations with different random numbers e.g. bootstrapping and Monte Carlo methods

Dividing sequence data by chromosome

Splitting chromosome sequences into smaller parts

Partitioning large BLAST (or other sequence) databases and/or query files

High Performance Computing (HPC)

Uses powerful special purpose systems called HPC clusters

Contain large numbers of processors acting in parallel

Each processor may contain multiple processing elements (cores) which can also work in parallel

Provide lots of memory and large amounts of fast (parallel) disk storage – ideal for data-intensive applications

Almost all clusters run the UNIX operating system

Typically run parallel programs containing inter-dependent tasks (e.g. finite element analysis codes) but also suitable for biostatistics and bioinformatics applications

HPC cluster hardware (architecture)

Network Switch

Compute Node

Compute NodeHead Node

parallel filestore

high speednetwork

standard (ethernet) network

connection to outside world

Typical cluster use (1)

Network Switch

Head Node

parallel filestore

login and upload data

input data files

Compute Node

Compute NodeHead Node

submit jobs (programs)

login from outside

Network Switch

Compute Node

Parallel Filestore

input data

Network Switch

Compute Node

task synchronisation(only if needed !)

compute nodesprocess data in parallel

Network Switch

Compute Node

parallel filestore

output data(results)

Network Switch

Head Node

parallel filestore

login and download results

output data files

Parallel BLAST examplelogin as: ianian@bioinf1's password:Last login: Tue Feb 24 14:45:31 2015 from uxa.liv.ac.uk

Parallel BLAST examplelogin as: ianian@bioinf1's password:Last login: Tue Feb 24 14:45:31 2015 from uxa.liv.ac.uk[ian@bioinf1 ~]$ cd /users/ian/chris/perl #change folder[ian@bioinf1 perl]$

Parallel BLAST examplelogin as: ianian@bioinf1's password:Last login: Tue Feb 24 14:45:31 2015 from uxa.liv.ac.uk[ian@bioinf1 ~]$ cd /users/ian/chris/perl[ian@bioinf1 perl]$[ian@bioinf1 perl]$ ls -lh farisraw_*.fasta #list files

Parallel BLAST examplelogin as: ianian@bioinf1's password:Last login: Tue Feb 24 14:45:31 2015 from uxa.liv.ac.uk[ian@bioinf1 ~]$ cd /users/ian/chris/perl[ian@bioinf1 perl]$[ian@bioinf1 perl]$ ls -lh farisraw_*.fasta -rw-r--r-- 1 ian ph 1.2G Feb 23 12:25 farisraw_1.fasta-rw-r--r-- 1 ian ph 1.4G Feb 23 12:25 farisraw_2.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:26 farisraw_3.fasta-rw-r--r-- 1 ian ph 1.3G Feb 23 12:26 farisraw_4.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:27 farisraw_5.fasta-rw-r--r-- 1 ian ph 1.3G Feb 23 12:27 farisraw_6.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:28 farisraw_7.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:24 farisraw_8.fasta-rw-r--r-- 1 ian ph 9.6G Feb 23 11:18 farisraw_complete.fasta[ian@bioinf1 perl]$

original query file

Parallel BLAST examplelogin as: ianian@bioinf1's password:Last login: Tue Feb 24 14:45:31 2015 from uxa.liv.ac.uk[ian@bioinf1 ~]$ cd /users/ian/chris/perl[ian@bioinf1 perl]$[ian@bioinf1 perl]$ ls -lh farisraw_*.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:25 farisraw_1.fasta-rw-r--r-- 1 ian ph 1.4G Feb 23 12:25 farisraw_2.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:26 farisraw_3.fasta-rw-r--r-- 1 ian ph 1.3G Feb 23 12:26 farisraw_4.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:27 farisraw_5.fasta-rw-r--r-- 1 ian ph 1.3G Feb 23 12:27 farisraw_6.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:28 farisraw_7.fasta-rw-r--r-- 1 ian ph 1.2G Feb 23 12:24 farisraw_8.fasta-rw-r--r-- 1 ian ph 9.6G Feb 23 11:18 farisraw_complete.fasta[ian@bioinf1 perl]$

partial query files

Parallel BLAST example[ian@bioinf1 perl]$ cat blast.sub #show file contents

job file

Parallel BLAST example[ian@bioinf1 perl]$ cat blast.sub#!/bin/bash#$ -cwd -V#$ -o stdout#$ -e stderr#$ -pe smp 8

blastn -query farisraw_${SGE_TASK_ID}.fasta \ -db /users/ian/chris/faris/dbs/RM_allfarisgenes_linuxbldb \ -out output${SGE_TASK_ID}.txt \ -word_size 11 -evalue .00001 -culling_limit 1 -max_target_seqs 5 \ -num_threads 8 \ -outfmt "6 qseqid sseqid qlen length qstart qend sstart send mismatch \ gaps qseq sseq pident evalue“[ian@bioinf1 perl]$

job options

BLAST query file

BLAST database

output file

takes on values [1..8]when jobs are submitted

use all 8 cores on eachcompute node (in parallel)

blastn -query farisraw_${SGE_TASK_ID}.fasta \ -db /users/ian/chris/faris/dbs/RM_allfarisgenes_linuxbldb \ -out output${SGE_TASK_ID}.txt \ -word_size 11 -evalue .00001 -culling_limit 1 -max_target_seqs 5 \ -num_threads 8 \ -outfmt "6 qseqid sseqid qlen length qstart qend sstart send mismatch \ gaps qseq sseq pident evalue“[ian@bioinf1 perl]$ qsub –t 1-8 blast.sub #submit jobs

blastn -query farisraw_${SGE_TASK_ID}.fasta \ -db /users/ian/chris/faris/dbs/RM_allfarisgenes_linuxbldb \ -out output${SGE_TASK_ID}.txt \ -word_size 11 -evalue .00001 -culling_limit 1 -max_target_seqs 5 \ -num_threads 8 \ -outfmt "6 qseqid sseqid qlen length qstart qend sstart send mismatch \ gaps qseq sseq pident evalue“[ian@bioinf1 perl]$ qsub –t 1-8 blast.subYour job-array 20164.1-8:1 ("blast.sub") has been submitted[ian@bioinf1 perl]$

Parallel BLAST example[ian@bioinf1 perl]$ qstat #show job status

Parallel BLAST example[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp01.liv.ac.uk 8 3 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp04.liv.ac.uk 8 4 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp03.liv.ac.uk 8 5 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp05.liv.ac.uk 8 7 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp06.liv.ac.uk 8 8[ian@bioinf1 perl]$

indicatesjob is running

Parallel BLAST example[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp01.liv.ac.uk 8 3 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp04.liv.ac.uk 8 4 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp03.liv.ac.uk 8 5 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp05.liv.ac.uk 8 7 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp06.liv.ac.uk 8 8[ian@bioinf1 perl]$

name of compute nodejob is running on

Parallel BLAST example[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp01.liv.ac.uk 8 3 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp04.liv.ac.uk 8 4 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp03.liv.ac.uk 8 5 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp05.liv.ac.uk 8 7 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp06.liv.ac.uk 8 8[ian@bioinf1 perl]$ qstat

Parallel BLAST example[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp01.liv.ac.uk 8 3 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp04.liv.ac.uk 8 4 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp03.liv.ac.uk 8 5 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp05.liv.ac.uk 8 7 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp06.liv.ac.uk 8 8[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6[ian@bioinf1 perl]$ qstat

Parallel BLAST example[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp01.liv.ac.uk 8 3 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp04.liv.ac.uk 8 4 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp03.liv.ac.uk 8 5 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp05.liv.ac.uk 8 7 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp06.liv.ac.uk 8 8[ian@bioinf1 perl]$ qstatjob-ID prior name user state submit/start at queue slots ja-task-ID----------------------------------------------------------------------------------------------- 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp00.liv.ac.uk 8 1 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp07.liv.ac.uk 8 2 20157 0.55500 blast.sub ian r 02/26/2015 14:32:49 bs@comp02.liv.ac.uk 8 6[ian@bioinf1 perl]$ qstat[ian@bioinf1 perl]$

Parallel BLAST example[ian@bioinf1 perl]$ ls -lh output*.txt #list output files

Parallel BLAST example[ian@bioinf1 perl]$ ls -lh output*-rw-r--r-- 1 ian ph 45M Feb 26 14:38 output1.txt-rw-r--r-- 1 ian ph 22M Feb 26 14:38 output2.txt-rw-r--r-- 1 ian ph 59M Feb 26 14:38 output3.txt-rw-r--r-- 1 ian ph 20M Feb 26 14:38 output4.txt-rw-r--r-- 1 ian ph 28M Feb 26 14:38 output5.txt-rw-r--r-- 1 ian ph 13M Feb 26 14:38 output6.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output7.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output8.txt[ian@bioinf1 perl]$

partial output files

Parallel BLAST example[ian@bioinf1 perl]$ ls -lh output*.txt-rw-r--r-- 1 ian ph 45M Feb 26 14:38 output1.txt-rw-r--r-- 1 ian ph 22M Feb 26 14:38 output2.txt-rw-r--r-- 1 ian ph 59M Feb 26 14:38 output3.txt-rw-r--r-- 1 ian ph 20M Feb 26 14:38 output4.txt-rw-r--r-- 1 ian ph 28M Feb 26 14:38 output5.txt-rw-r--r-- 1 ian ph 13M Feb 26 14:38 output6.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output7.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output8.txt[ian@bioinf1 perl]$ cat output*.txt > output_complete.txt[ian@bioinf1 perl]$ ls -lh output_complete.txt

combine partialresults files

Parallel BLAST example[ian@bioinf1 perl]$ ls -lh output*.txt-rw-r--r-- 1 ian ph 45M Feb 26 14:38 output1.txt-rw-r--r-- 1 ian ph 22M Feb 26 14:38 output2.txt-rw-r--r-- 1 ian ph 59M Feb 26 14:38 output3.txt-rw-r--r-- 1 ian ph 20M Feb 26 14:38 output4.txt-rw-r--r-- 1 ian ph 28M Feb 26 14:38 output5.txt-rw-r--r-- 1 ian ph 13M Feb 26 14:38 output6.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output7.txt-rw-r--r-- 1 ian ph 30M Feb 26 14:38 output8.txt[ian@bioinf1 perl]$ cat output*.txt > output_complete.txt[ian@bioinf1 perl]$ ls -lh output_complete.txt-rw-r--r-- 1 ian ph 499M Feb 26 14:44 output_complete.txt[ian@bioinf1 perl]$

combined results

Some HPC clusters available at Liverpool

bioinf1

System bought by the Institute of Translational Medicine for use in biomedical research about 5 years ago

9 compute nodes each with 8 cores and 32 GB of memory (one node has 128 GB of memory)

76 TB of main (parallel) storage

chadwick

Main CSD HPC cluster for research use

118 nodes each with 16 cores and 64 GB memory (one node has 2 TB of memory)

Total of 135 TB of main (parallel) storage

Fast (40 GB/s) internal network

High Throughput Computing (HTC) using Condor

No dedicated hardware - uses ordinary classroom PCs to run jobs when then they would otherwise be idle (usually evenings and weekends)

Jobs may be interrupted by users logging into Condor PCs – works best for short running jobs (10-20 minutes ideally)

Only suitable for applications which use independent tasks (need to use HPC inter-dependent tasks)

No shared storage – all data files must be transferred to/from the Condor PCs

Limited memory and disk space available since Condor uses only commodity PCs

However… Condor is well suited to many statistical and data-intensive applications !

A “typical” Condor pool

Condor Server

Desktop PC

Execute hostsExecute hosts

login and upload input data

Condor Server

Desktop PC

jobsjobs

Condor Server

Desktop PC

results results

Condor Server

Desktop PC

download results

University of Liverpool Condor Pool

contains around 750 classroom PCs running the CSD Managed Windows 7 Service

Each PC can support a maximum of 4 jobs concurrently giving a theoretical capacity of 3000 parallel jobs

Typical spec: 3.3 GHz Intel i3 dual-core processor, 8 GB memory, 128 GB disk space

Tools are available to help in running large numbers of R and MATLAB jobs (other software may work but not commercial packages such as SAS and Stata)

Single job submission point for Condor jobs provided by powerful UNIX server

Service can be also accessed from a Windows PC/laptop using Desktop Condor (even from off-campus)

Desktop Condor (1)

Desktop Condor (2)

Desktop Condor (3)

Personalised Medicine example

project is an example of a Genome-Wide Association Study

aims to identify genetic predictors of response to anti-epileptic drugs

try to identify regions of the human genome that differ between individuals (referred to as Single Nucleotide Polymorphisms or SNPs)

800 patients genotyped at 500 000 SNPs along the entire genome

Statistically test the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)

large data-parallel problem using R – ideal for Condor

divide datasets into small partitions so that individual jobs run for 15-30 minutes

batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC

Radiotherapy example

large 3rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods

tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks

divided problem into 250 then 5 000 and eventually 50 000 Condor jobs

required ~ 2 600 days of CPU time (equivalent to ~ 3.5 years on dual core PC)

Condor simulation completed in less than one week

average run time was ~ 70 min

Summary Parallelism can help speed up the solution of many research

computing problems by dividing large problems into many smaller ones which can be tackled at the same time

High Performance Computing clusters

Typically used for small numbers of long running jobs

Ideal for applications requiring lots of memory and disk storage space

Almost all systems are UNIX-based

Condor High Throughput Computing Service

Typically used for large/very large numbers of short running jobs

Limited memory and storage available on Condor PCs

Support available for applications using R (and MATLAB)

No UNIX knowledge needed with Desktop Condor

Next steps Condor Service information: http://condor.liv.ac.uk

Information on bioinf1 and HPC clusters: http://clusterinfo.liv.ac.uk

Information on the Advanced Research Computing (ARC) facilities: http://www.liv.ac.uk/csd/advanced-research-computing

To contact the ARC team email: arc-support@liverpool.ac.uk

To request an account on Condor or chadwick use:http://www.liv.ac.uk/media/livacuk/computingservices/help/eScienceform.pdf

For an account on bioinf1 – just ask me ! (i.c.smith@liverpool.ac.uk )

Ian C. Smith* Introduction to research computing using the High Performance Computing facilities and...

Documents

Condor-G: Condor and Grid Computing - University of California

Grid Computing I CONDOR. 2 Agenda What is condor? What is Condor good for? How condor works? How to submit a job?

Distributed Computing in Practice: The Condor …research.cs.wisc.edu/htcondor/doc/condor-practice.pdf · Distributed Computing in Practice: The Condor Experience Douglas Thain, Todd

Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center

Condor-G: Condor and Grid Computing - UCSDTier2 < TWiki · 2011-11-08 · Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing

The flight of the Condor - a decade of High Throughput Computing

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011

Www.cs.wisc.edu/~miron Welcome to CW 2007!!!. miron The Condor Project (Established ‘85) Distributed Computing research performed by

Agile Condor: Scalable High Performance Embedded Computing

Condor and Distributed Computing David Ríos CSCI 6175 Fall 2011

Distributed Computing in Practice: The Condor Experience CS739 11/4/2013

Grid Computing at The Hartford Condor Week 2008 Robert Nordlund robert.nordlund@hartfordlife.com

Condor, Condor EcoFlexd2z4qs2e3spnc1.cloudfront.net/.../2083/nilfisk-advance-condor-4830… · Condor, Condor EcoFlex™ Instructions For Use - Original Instructions Instrucciones

Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005

Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool

Agile Condor: Scalable High Performance Embedded Computing ...on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Areospace_… · Agile Condor: Scalable High Performance Embedded

Scientific Computing MPP · 2017. 12. 19. · Scientific computing @ MPP 18 Resources: MPP Computing – > 200 desktop PCs via condor batch system Ubuntu 16.04 or Suse tumbleweed

High Throughput Urgent Computing Jason Cope jason.cope@colorado.edu Condor Week 2008

(Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor

High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism