24
Research Computing Research Computing The Apollo HPC Cluster The Apollo HPC Cluster

Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Embed Size (px)

Citation preview

Page 1: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Research ComputingResearch ComputingThe Apollo HPC ClusterThe Apollo HPC Cluster

Page 2: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

High Performance Computing ?High Performance Computing ?

Generically, computing systems comprised of multiple processors linked together in a single system. Used to solve problems beyond the scope of the desktop.

High performance computing– Maximising number of cycles per second, usually parallel

High throughput computing– Maximising number of cycles per year, usually serial

Facilitating the storage, access and processing of data– Coping with the massive growth in data

Page 3: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

High Performance Computing High Performance Computing High performance computing

– tasks must run quickly– single problem split across many processors

• Weather forecasting• Markov models• Imaging processing

– 3D image reconstruction– 4D visualisation

• Sequence assembly• Whole genome analysis

Page 4: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

High throughput computingHigh throughput computing

A lot of work done over a long time frame– One program run many times, eg searching a large data set

– Loosely coupled.• ATLAS analysis

• Genomics (sequence alignment, BLAST etc)

• Virtual Screening (eg in drug discovery)

• Parameter exploration (simulations)

• Statistical analysis (eg bootstrap analysis,)

Page 5: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

ApolloApollo Cluster - Aims Cluster - Aims

Computing provision beyond capability of desktop Priority access determined by Advisory Group Shared infrastructure and support from IT Services Extension of facility by departments

– Storage (adding to Lustre, access to SAN)

– CPU

– Software Licenses

More compute nodes will be purchased by IT Services as budgets allow

Page 6: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

ApolloApollo Cluster - Hardware Cluster - Hardware

Total 264 cores– 14 x 12 core 2.67GHz Intel nodes - 168 cores– 2 x 48 core nodes 2.2GHz AMD nodes – 96 cores

(17 x 8 core blades , GigE (informatics) - 136 cores)

48 GB RAM for Intel nodes 256 GB RAM for AMD nodes 20 TB Home NFS file system (backed up) 80 TB Lustre file system (scratch, not backed up) QDR (40Gbs) Infiniband interconnect

Page 7: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

ApolloApollo Cluster - Hardware Cluster - Hardware

Page 8: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Filesystems 1Apollo Cluster - Filesystems 1

Home - 20 TB, RAID6 set of 12 x 2TB disks– Exported via NFS

– Backed up, keep your valuable data here

– Easily overloaded if many cores read/write at same time

Lustre parallel file system - /mnt/lustre/scratch– Redundant metadata server

– Three object servers, each with 3 x 10 TB RAID6 OST

– Data striping configured by file or directory

– Can stripe across all 108 disks. Aggregate data rate ~ 3.8GB/s

– NOT backed up, for temporary storage

Local 400GB scratch disk per node

Page 9: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Filesystems 2Apollo Cluster - Filesystems 2

© Alces Software Limited

Page 10: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Lustre PerformanceApollo Cluster - Lustre Performance

(c ) 2010 Alces Software Ltd

IOIOZONE Benchmark Summary

Page 11: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Lustre 1Apollo Cluster - Lustre 1

Figure (c)2009 Cray Inc

Page 12: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Lustre 2Apollo Cluster - Lustre 2

Storing a single file across multiple OSTs (striping) offers two benefits: •an increase in the bandwidth available when accessing the file•an increase in the available disk space for storing the file.

Striping has disadvantages:•increased overhead due to network operations and server contention •increased risk of file damage due to hardware malfunction.

Given the tradeoffs involved, the Lustre file system allows users to specify the striping policy for each file or directory of files using the lfs utility.

Page 13: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Lustre 3Apollo Cluster - Lustre 3

Further reading …

– Lustre FAQhttp://wiki.lustre.org/index.php/Lustre_FAQ

– Lustre 1.8 operations manual http://wiki.lustre.org/manual/LustreManual18_HTML/index.html

– I/O Tips - Lustre Striping and Parallel I/Ohttp://www.nics.tennessee.edu/io-tips

Page 14: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Software Apollo Cluster - Software

Module system used for defining paths and libraries– Need to load software required

– Module avail, module add XX, module unload XX

– Access optimised math libraries, MPI stacks, compilers etc

– Latest version of packages, eg python, gcc easily installed

Intel parallel studio suite– C C++ Fortran, MKL, performance analysis etc

gcc and Open64 compilers Will compile/install software for users if asked

Page 15: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Software Apollo Cluster - Software

Compilers/tools Libraries Programsant acml adfgcc suite atlas aimproIntel (c, c++, fortran) blas, gotoblas albertagit cloog gadgetjdk 1.6_024 fftw2, fftw3 gapmercurial gmp Gaussian 09open64 gsl hdf5python 2.6.6 hpl idlsbcl lapack, scalapack matlab  MVAPICH,

MVAPICH2mricron

  mpfr netcdf  nag paraview  openMPI stata  ppl  WRF

Page 16: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Queues 1Apollo Cluster - Queues 1

Sun Grid Engine used for batch system parallel.q for MPI and OpenMP jobs

– Intel nodes – Slot limit of 48 cores per user at present

serial.q for serial and OpenMP jobs– AMD nodes– Slot limit of 36 cores per user

inf.q for serial or openMP jobs - Informatics use only No other limits – need user input re configuration

Page 17: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - serial job scriptApollo Cluster - serial job script

Queue is by default the serial.q

#!/bin/sh #$ -N sleep #$ -S /bin/sh #$ -cwd #$ -M [email protected]#$ -m beaecho Starting at: `date`sleep 60 echo Now it is: `date`

Page 18: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - parallel job scriptApollo Cluster - parallel job script

For parallel jobs you must specify the pe environment assigned to the script. parallel environments: openmpi or openmp. mvapich2 often less efficient.

#!/bin/sh #$ -N JobName #$ -M [email protected]#$ -m bea#$ -cwd #$ -pe openmpi NUMBER_OF_CPUS # eg 12-36#$ -q parallel.q#$ -S /bin/bash # source modules environment: . /etc/profile.d/modules.sh module add gcc/4.3.4 qlogic/openmpi/gcc mpirun -np $NSLOTS -machinefile $TMPDIR/machines /path/to/exec

Page 19: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - MonitoringApollo Cluster - Monitoring

Ganglia statistics for apollo and feynman (the EPP cluster) are http://feynman.hpc.susx.ac.uk

Page 20: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - AccountingApollo Cluster - Accounting

Simply, qacct –o Accounting monitored to show fair usage, etc Statistics may be posted on the web.

Page 21: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - DocumentationApollo Cluster - Documentation

To get an account on the cluster, email [email protected]

Userguides and example scripts are available on the cluster in /cm/shared/docs/USERGUIDE

Other documentation in /cm/shared/docs path

Page 22: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Adding nodesApollo Cluster - Adding nodes

C6100 chassis + four servers each 2.67 GHz, 12 core, 48GB RAM, IB card, licences

R815 48 core 2.2GHz AMD

Departments guaranteed 90% exclusive of their nodes. 10% sharing with others, plus back fill of idle time.

Contact [email protected] for latest pricing

Page 23: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Apollo Cluster - Adding storageApollo Cluster - Adding storage

Configuration of an additional Object Server–  Server R510, 8 core  24GB RAM + 12 x 2 TB disks  

–  JBOD MD1200 + 12 x 2TB disks + H800 RAID card

– The minimum increment is the server which provides about 19TB after formatting and RAID6

– Two JBODS give another 38TB.

– Total storage ~57 TB in 6 OST

Contact [email protected] for pricing

Page 24: Research Computing The Apollo HPC Cluster Jeremy Maris Research Computing IT Services University of Sussex

Questions?Questions?