The Need for Speed: Parallelization in GEM

The Need for Speed:Parallelization in

GEM

Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN

Many thanks to Michel Valin

Single processor limitationsSingle processor limitations

● Processor clock speed is limited● Physical size of processor limits speed because signal speed

cannot exceed speed of light

● Single processor speed is limited by integrated circuits feature size (propagation delays and thermal problems)

● Memory (size and speed - especially latency)

● The amount of logic on a processor chip is limited by real estate considerations (die size / transistor size)

● Algorithm limitations

The reason behind parallel programming

Parallel computing: a solutionParallel computing: a solution

● Increase parallelism within processor (multi operand functional units like vector units)

● Increase parallelism on chip (multiple processors on chip)

● Multi processor computers● Multi computer systems using a communication network

(latency and bandwidth considerations)

Parallel computing paradigmsParallel computing paradigms

● memory taxonomy:memory taxonomy:

● SMP Shared Memory Parallelism

● One processor can “ see ” another's memory● Cray X-MP, single node NEC SX-3/4/5/6

● DMP Distributed Memory Parallelism

● Processors exchange “ messages ”● Cray T3D, IBM SP, ES-40, ASCI machines

● hardware taxonomy:

SISD (Single Instruction Single Data)

SIMD (Single Instruction Multiple Data)

MISD (Multiple Instruction Single Data)

MIMD (Multiple Instruction Multiple Data)

● programmer taxonomyprogrammer taxonomy:

SPMD : Single Program Multiple Data

MPMD : Multiple Program Multiple Data

SMP architecturesSMP architectures

Network / crossbarBus topology

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

NODE

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

Mem

NODE

SMPOpenMP (microtasking / autotasking)

OpenMP works at the loop level (small granularity often at the loop level), multiple CPUs execute the same code in a shared memory space

OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.

OpenMP uses the fork-join model of parallel execution:

OpenMP Basic features (FORTRAN “comments”)

PROGRAM VEC_ADD_SECTIONS

INTEGER ni, I, n PARAMETER (ni=1000) REAL A(ni), B(ni), C(ni)

! Some initializations n=4 DO I = 1, ni A(I) = I * 1.0 B(I) = A(I) ENDDO! At the Fortran level: call omp_set_num_threads(n)

!$OMP PARALLEL SHARED(A,B,C), PRIVATE(I)

!$omp do DO I = 1, ni C(I) = A(I) + B(I) ENDDO!$omp enddo

!$OMP END PARALLEL

END

2 ways to initiate threads:

At the shell level:n=4export OMP_NUM_THREADS=n

Parallel region

OpenMP!$omp parallel!$omp do do n=1,omp_get_max_threads() call itf_phy_slb ( n , F_stepno,obusval, cobusval, $ pvptr, cvptrp,cvptrm, ndim, chmt_ntr, $ trp,trm, tdu,tdv,tdt,kmm,ktm, $ LDIST_DIM, l_nk) enddo!$omp enddo!$omp end parallel

!$omp critical jdo = jdo + 1!$omp end critical

!$omp single call vexp (expf_8,xmass_8,nij)!$omp end single

SMP: General remarksSMP: General remarks

● Shared memory parallelism at the loop level can often be implemented after the fact if what is desired is a moderate level of parallelism

● It can be also done to a lesser extent at the thread level in some cases but reentrancy, data scope (thread local vs global) and race conditions can be a problem.

● Does NOT scale all that well

● Limited to the real estate of a node

DMP architecture

Node Node Node

Node Node Node

High speed interconnect (network / crossbar)

. . . . . .

. . .. . .

Cpu

Cpu

Cpu

Cpu

Mem

Mem

Mem

NODE

2D domain decomposition:2D domain decomposition:regular horizontal block partitioningregular horizontal block partitioning

Gni

Lnj

1

Lnj

1

1 Lni 1 Lni

1

1

Gnj

Lni Lni+1

Lnj

Lnj+1

global indexing

local indexingPe (0,0)

Pe (0,1) Pe (1,1)

Pe (1,0)

N

S

W E

PE topology: npex=2, npey=2

PE #0 PE #1

PE #2 PE #3Rank

PE matrix

High level operations

● Halo exchange● What is a halo ?● Why and when is it necessary to exchange a halo ?

● Data transpose● What is a data transpose ?● Why and when is it necessary to transpose data ?

● Collective and Reduction operations

2D array layout with halos

Halo y

Halo x Halo x

Halo y

Halo y

Halo x Halo x

Halo y

Mini Maxi1 Lni

1

Lnj

Minj

Maxj

Inner halo

Outer halo

Private data

N

S

W E

● Need to access neighboring data in order to perform local computation

● In general any stencil type discrete operator

dfdx(i) = (f(i+1) - f(i-1)) / (x(i+1)-x(i-1))

● Halo width depends on the operator

Halo exchange:Why and when?

1 Lni

Halo x Halo x

Halo x Halo x

Halo exchange 051

How many neighbor PEs must local PE exchange data with to get data from the shaded area (outer halo)?

Local pe

South

North

EastWest

North West North East

South West South East

PE topology: npex=3, npey=3

Data Transposition 051PE topology: npex=4, npey=4

X

Y

Z

npex

npey

X

Y

Z

T2

npex

npey

X

Y

Z

T1

npex

npey

What is MPI ?

● A Message Passing Interface

● Communications through messages can be

● Cooperative send / receive (democratic)

● One sided get / put (autocratic)

● Bindings defined for FORTRAN, C, C++

● For parallel computers, clusters, heterogeneous networks

● Full featured (but can be used in simple fashion)

Time

Message length

Startup

Tw = cost / word

● Include 'mpif.h'● Call MPI_INIT(ierr)● Call MPI_FINALIZE(ierr)● Call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)● Call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)● Call MPI_SEND(buffer,count,datatype,destination,tag,comm,ierr)● Call MPI_RECV(buffer,count,datatype,source,tag,comm,status,ierr)

MPI_gather, MPI_allgatherMPI_scatter, MPI_alltoallMPI_bcast,MPI_reduce, MPI_allreduce

mpi_summpi_min, mpi_max

The RPN_COMM toolkitMichel Valin

● NO INCLUDE FILE NEEDED (like mpif.h)

● Higher level of abstraction

● Initialization / termination of communications

● Topology determination

● Point to point operations

● Halo exchange● (Direct message to NSWE neighbor)

● Collective operations

● Transpose● Gather / distribute● Data reduction

● Equivalent calls to most frequently used MPI routines● MPI_[something] => RPN_COMM_[something]

Partitioning Global Data

Gni=62Gnj=25 PE topology: npex=4, npey=3

lni=16lnj=8

lni=16lnj=9

lni=15lnj=9

lni=16lnj=8

lni=15lnj=9

lni=16lnj=9

lni=15lnj=8

lni=16lnj=9

lni=16lnj=9

lni=14lnj=9

lni=16lnj=7

lni=16lnj=9

lni=16lnj=9

lni=14lnj=7

Valin(Gni + npex – 1) / npexThomas

Dimensions of largest subdomain NOT affected

checktopo -gni 62 -gnj 25 -gnk 58 -npx 4 -npy 2 -pil 7 -hblen 10

DMP Scalability

Scaling up with an optimumsubdomain dimension

Size: 500 x 50 onvector processor systems

Size: 100 x 50 oncache systems

Scaling up on a fixed size problem sze

Time to solution shouldremain the same

Time to solution shoulddecrease linearly with the

# of CPUs

700

750

800

850

900

0 5 10 15 20 25 30 35

SX4_1NodeSX4_2NodesVPP700

MC2 Performance on NEC SX4and Fujitsu VPP700

MC2 Performance on NEC SX4and Fujitsu VPP700

Flo

p R

ate

/ P

E (

MF

lop

s/s

ec

.)

Number of PEs

SX4: npx=2 VPP700: npx=1

Grid: 513 x 433 x 41

IFS Performance on NEC SX4and Fujitsu VPP700

IFS Performance on NEC SX4and Fujitsu VPP700

Number of PEs

100

1000

10 100

Fo

rec

as

t d

ay

s /

da

y

VPP700

SX-4

Amdahl's law for parallel programmingAmdahl's law for parallel programming The speedup factor is influenced very much by the residual serial (non parallelizable) work. As the number of processors grows, so does the damage caused by non parallelizable work.

Scalability: limiting factors

• Any algorithms requiring global communications

– One should THINK LOCAL

• SL transport on a global configuration lat-lon grid

point model – numerical poles (GEM)

• 2-time-level fully implicit discretization leading to an

elliptic problem: direct solver requires data transpose

• Any algorithms producing inherent load imbalance

DMP - General remarks

● More difficult but more powerful programming paradigm

● Easily combined with SMP (on all MPI processes)

● Distributed memory parallelism does not happen, it must be DESIGNED.

● One does not parallelizes a code, the code must be rebuilt (and often redesigned) taking into account the constraints imposed upon the dataflow by message passing. Array dimensioning and loop indexing are likely to be VERY HEAVVILY IMPACTED.

● One may get lucky and HPF or an automatic parallelizing compiler will solve the problem (if one believes in miracles, Santa Claus, the tooth fairy or all of them).

Web sites and Books

● http://pollux.cmc.ec.gc.ca/~armnmfv/MPI_workshop

● http://www.llnl.gov/ , OpenMP, threads, MPI, ...

● http://hpcf.nersc.gov/

● http://www.idris.fr/ , en français, OpenMP, MPI, F90

● Using MPI, Gropp et al, ISBN 0-262-57204-8

● MPI, The Compl. Ref., Snir et al, ISBN 0-262-69184-1

● MPI, The Compl. Ref. vol 2, Gropp et al, ISBN 0-262-57123-4

Documents

The Need for Speed: Parallelization in GEM