Introduction to HPC at UCL - UCLouvain. 21th 2015 Introduction to High Performance Computing at UCL CISM Equipment: clusters Charles Manneback (1894-1975) Georges Lemaitre's friend

Institut de calcul intensif et destockage de masse

Introduction to HPC at UCL● Technical reminders and available equipment

• source code • compiling• optimized libraries: BLAS, LAPACK• OpenMP• MPI

● Job submission : (SGE), Torque/PBS, Condor, SLURM

● CISM & CECI: working principles, management, access

● Data Center

● From algorithm to computer program: optimization and parallel code

October 21th 2015 Damien François and Bernard Van Renterghem

Introduction to High Performance Computing at UCLOct. 21th 2015

CISM Cache memory

program execution = information exchange between CPU and RAM memory (program instructions and data)

RAM slow + sequential set of instructions > cache memory: instructions and/or data read by entire blocks transferred from RAM to cache memory

Cache L1, L2, L3


CISM Clusters

Big number of standard elements (low cost)

Network performance critical

Low cost computer

Low cost computer

Low cost computer

Low cost computer

Low cost computer


CISM Symmetric multi-processors


CISM

Equipment: servers


CISM Equipment: servers

CISM servers

● Manneback● (Green)● (Ingrid CP3) ● « exotic » machines

UCL CECI servers

● Hmem● Lemaitre2

CECI servers

● Vega ULB● Hercules UNamur● Dragon1 Umons● Nic4 ULg


CISM Equipment: clusters

● Charles Manneback (1894-1975) Georges Lemaitre's friend

● 79 nodes(632 core) from old green in 2008 but still powerfull● 21n Indus partition (336 core)● 9n Oban partition (288 core)● 8n Ivy partition (128 core)

●145 nodes, 1616 core, 5800GB RAM, 1.7 Tflops

● Installed compilers: GCC, Intel, PGI

● OS: GNU/Linux Centos 6.6 os 2.6.32

● Batch system: SLURM 14.11.8

Manneback


CISM Equipment: clusters

Manneback

Welcome to | |

__ `__ \ _` | __ \ __ \ _ \ __ \ _` | __| | / | | | ( | | | | | __/ | | ( | ( < _| _| _| \__,_| _| _| _| _| \___| _.__/ \__,_| \___| _|\_\

Charles Manneback Lemaitre fellow cluster

(GNU/Linux CentOS 6.6) front-end: 2x8Core E5-2650@2GHz/64GB RAM

mb007 1 node 8 core X5500/24GB RAM 2 GPU TeslaC1060mb008-019 12 nodes 8 core L5520/24GB Infiniband SDR 10Gbpsmb020-035 14 nodes 8 core L5420/16GBmb040 1 node 16 core E5-2660/64GB GPU Tesla M2090, XeonPhimb059-095 21 nodes 8 core L5420/16GB (Old Green)mb197-058 42 nodes 8 core L5420/32GB (Old Green)mb101-121 21 nodes 16 core E5-2650/64GB (Indus Naps project)mb151-156 6 nodes 32 core AMD Opt6276/128GB (Oban)mb158-160 3 nodes 32 core E5-4640/128 or 256GB (Oban)mb161-168 8 nodes 16 core E5-E5-2650v2/64 GB (Ivy)

tot: 1488 core / 5544 GB RAM.)


CISM

Equipment: clusters

Green

● Installed 2008 and is now deprecated.● Still 16 nodes, 128 core, 512GB RAM, 1280 Gflops● 8 core Xeon L5420 2.5 GHz ● 32 GB RAM by node● 1 NFS server with 14 TB of storage (SATA disks) for /home● Installed compilers: GCC, Intel, PGI● OS: Scientific Linux 5.9 with ClusterVision OS (GNU/Linux 2.6.18)● Batch system: Sun Grid Engine 6.1


CISM

Equipment: clusters

Ingrid CP3 for CERN CMS/LHC(516 Cores)

● 1 front-end node (AMD Opteron 2.2 GHz) with 4 GB RAM● 17 nodes of 2 dual core CPU (Xeon 5160 3.0 GHz) with 4 GB RAM● 16 nodes of 2 quad core CPU (Xeon 5345 2.3 GHz) with 16 GB RAM● 12 nodes of dual CPU (AMD Opteron 248 2.2 GHz) with 3 GB RAM● 32 nodes of 2 quad core CPU (Xeon E5420 2.5 GHz) with 16 GB RAM● 8 nodes of 2 dual core CPU (Xeon 2.6 GHz) with 4 GB RAM● Storage: ingrid-fs 11TB + 6 x 11 TB (CMS) + 3x 36 TB (CMS)● OS: Scientific Linux CERN 4.7 (GNU/Linux 2.6.9)● Gigabit ethernet● Batch system: Condor


CISM Equipment: exotic machines

Other peculiar machines● Lm9 : interactive matlab, TermoCalc, R,

2 6 core [email protected] GHz, 144 GB RAM

● Lmgpu (is Mb07): dual quad [email protected] (85 Gflops) + ( 2x) Tesla M1060 = 240 GPU core, 624 SP Gflops, 77 DP Gflops GPU.

● Mb40 : dual octa [email protected] +(2x) Tesla M2090 = 512 GPU core, 1332 SP Gflops, 666 DP Gflops GPU + Xeon Phi 61 [email protected] 1011 DP Gflops

● SCMS-SAS 1&2 : for SAS, STATA, R,...dual hexa [email protected] + Tesla C2075 = 512 GPU core, 127 Gflops + 1332 SP Gflops GPU

● LmPp001-003 : lemaitre2 PostProcessing Nvidia Quadro 4000 = 256 GPU core, 486 SP Gflops, 243 DP Gflops GPU.


CISM Equipment CECI Clusters

● 16 Dell PowerEdge R815 + 1 HP + 3 Ttec

● 17x48 core AMD Opteron 6174(Magny-Cours) @2.2GHz

+ 2x8 core AMD Opteron 8222 @3GHz (24h partition)

● 2: 512, 7:256, 8:128, 3:128 GB RAM

● /scratch 3.2TB or 1.7TB

● Infiniband 40Gb/s

● SLURM batch queuing system

=

● Tot: 832 core, 4000 TB RAM, 22 TB /scratch, 11TB /home, 7468 GFlops

Hmem ( www.ceci-hpc.be )

http://www.ceci-hpc.be/


CISM

● 112 HP DL380 with 2x6 core [email protected] 48GB RAM

● /scratch lustreFS 120TB, /tmp 325GB

● Infiniband 40Gb/s

● SLURM batch queuing system

=

● Tot: 1344 core, 5.25 TB RAM, 120 TB /scratch, 30TB /home, 13.6 TFlops

Lemaitre2 ( www.ceci-hpc.be )

Equipment CECI Clusters

mailto:[email protected]

http://www.ceci-hpc.be/


CISM

ULB+Unamur,+UMons+UCL+ULg = CECI

Equipment CECI Clusters

See www.ceci-hpc.be/clusters.html


CISM To reduce computing time…

… improve your code

● choice of algorithm ● source code● optimized compiling● optimized libraries

… use parallel computation

● OpenMP (mostly on SMP machines) ● MPI


CISM Source code

• Algorithm choice: volume of calculation increases with n, n x n,…? Stability ?

● indirect addressing expensive (pointers) ● fetching order of array elements (for optimal use of cache memory)● loop efficiency (get all uneccessary bits and pieces out of them)

• Programming language: FORTRAN, C, C++,…?

• Coding practise


CISM Compiling

• The compiler…

• Optimization options: -01, -02, -03

• Different qualities of compilers !!

● translates an instruction list written in a high level language into a machine readable (binary) file [= the object file]

e. g. ifc –c myprog.f90 generates object file myprog.o● link binary object files to produce an executable file

e. g. ifc –o myprog module1.o libmath.a myprog.o generates the executable file (= program) myprog


CISM Optimized libraries: BLAS

Basic Linear Algebra Subroutines

● set of optimized subroutines to handle vector x vector, matrix x vector, matrix x matrix operations (for real and complex numbers, single or double precision)

● the subrouines are optimized for a specific machine CPU/OS

● See http://www.netlig.org/blas● Example…

http://www.netlig.org/blas



● compiling from BLAS source: ifc –o mvm sgemv.f mvm.f

● compiling with pre-compiled BLAS library (optimized for Intel CPU):

ifc –o mvm mvm.f sblasd13d.a

real*8 matlxc(nl, nc)real*8 vectc(nc), result(nl)

call random_number(matlxc)call random_number(vectc)

do i=1,nl result(i)=0.0 do j=1,nc result(i)=result(i)+matlxc(i,j)*vectc(j) end doend do

call SGEMV('N',nl,nc,1.0d0,matlxc,nl,vectc,1,0.d0,result,1)



Performance comparison of Intel and PGI FORTRAN compilers, for self-made code, BLAS code and pre-compiled optimized libraries (matrix 10,000 x 5,000)

Compiler Subroutine Options Mflpos

Intel (ifc) DO loop - O0 11

- O3 11

BLAS source - O0 42

O3 115

BLAS compiled - O0 120

O3 120

PGI (pgf90) DO loop - O0 11

- O3 11

BLAS source - O0 48

- O3 57

BLAS compiled -O0 116

-O3 119


CISM Optimized libraries: LAPACK

• Linear Algebra Subroutines:

● linear equation system Ax=b● least square: min ||Ax-b||²● eigen value: Ax=λx, AX=λBx● for real or complex, single or double precision● includes all utility routines (LU factoring, Cholesky,…)

• Based on BLAS (don't depend on hardware, always optimized)

• See http://www.netlib.org/lapack

http://www.netlib.org/lapack


CISM OpenMP

• Open Memory Parallelism: standard language (compiler directives, functions, environment variables) for shared memory architectures (OpenMP 2.0)

• Principle: compiler directives > parallelism details are left to the compiler > fast implementation

!OMP PARALLEL DO

modèle fork and join…

DO I=1,1000 a(i)=b(i)*c(i)END DO…


CISM MPI environment

• MPI = Message Passing Interface (2.0)

• Principle: the program has full control over data exchange between nodes while distributing work and managing communication between nodes

• Widely used standard for clusters (but also exists for SMP boxes)

…REAL a(100) …C Process 0 sends, process 1 receives: if( myrank.eq.0 ) then call MPI_SEND(a,100,MPI_REAL,1,17,MPI_COMM_WORLD,ierr) else if ( myrank.eq.1 ) then call MPI_RECV(a,100,MPI_REAL,0,17,MPI_COMM_WORLD,status,ierr) endif …


CISM Job submission

• Goal: one single task per CPU

• Principle: the user hands his program over to an automatic job management system, specifying his requirements (memory, architecture, number of cpus,…). When the requested resources become available, the job is dispatched and starts running.

• Several batch systems are used at CISM:

● Condor: (on Ingrid CP3 Tier2)● SGE: Sun Grid Engine (on green)● Slurm : on Manneback, Hmem, Lemaitre2 & CECI


CISM Job submission

• Submission script examples…

• To submit your job: sbatch myscript

# SGE example#! /bin/sh

#$ -pe mpich 8#$ -l h_vmem=2G, num_proc=2#$ -M [email protected] < mydata....


CISM CISM: research environment

ELEN

TERM

SC

TOPO

ELIENAPS

RDGN

RECI

BSMA

IMAP

MOST

MEMA

BIB

COMU

LICR

ELEN

INGI

INMA

ELIC

INFM

CP3

FACM

NAPS

LOCI

GERU

PAMO

LSM

ECON

RSPO


CISM CISM

• Equipment and support available for any UCL (and CECI) member

• Equipments are acquired through projects

• Goals: joining forces to acquire and manage more powerful equipments

• Institut de Calcul Intensif et de Stockage de Masse:

● management committee composed of representatives of user's entities: debates and decides on strategies; chairman elected for four years : cucism (6x/y) and cdcism (2x/y)

● offices in Mercator; machine rooms in Pythagore and Marc de Hemptinne

● daily management by technical computer team, under leadership of CISM Director (elected for four years)


CISM CISM management team

Thomas KeutgenDirecteur CISM

Luc SindicGestionnaire système de stockage de masse

Bernard Van RenterghemGestionnaire système &

support utilisateur

Damien FrançoisGestionnaire système &

support utilisateur


CISM CECI : Consortium FNRS

David ColignonLogisticien

• Consortium of HPC Equipments

● UCL: CISM – PCPM - FYNU● ULB : IIHE and SMN● UNamur : iSCF● UMons: CRMM and SCMN● Ulg: SEGI (NIC)

Juan CabreraLogisticien


CISM Environmental challenges



• two 60 KW water chillers

Aquarium

• water cooling (rack based)



• total hosting capacity 120 KW

• electrical redundancy and 200 KVA UPS protection

• 5 m3 buffer tank• redundant

pumps, electrical feed through independent UPS

Aquarium


CISM DC3 : CISM-CP3 Data Center

DC3 will be ready in 2016...







Documents

Introduction to HPC at UCL - UCLouvain. 21th 2015 Introduction to High Performance Computing at UCL CISM Equipment: clusters Charles Manneback (1894-1975) Georges Lemaitre's friend