Upload
egil
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The Need for Speed: Parallelization in GEM. Michel Desgagné. Recherche en Prévision Numérique Environment Canada - MSC/RPN. Many thanks to Michel Valin. The reason behind parallel programming. Single processor limitations. Processor clock speed is limited - PowerPoint PPT Presentation
Citation preview
The Need for Speed:Parallelization in
GEM
Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN
Many thanks to Michel Valin
Single processor limitationsSingle processor limitations
● Processor clock speed is limited● Physical size of processor limits speed because signal speed
cannot exceed speed of light
● Single processor speed is limited by integrated circuits feature size (propagation delays and thermal problems)
● Memory (size and speed - especially latency)
● The amount of logic on a processor chip is limited by real estate considerations (die size / transistor size)
● Algorithm limitations
The reason behind parallel programming
Parallel computing: a solutionParallel computing: a solution
● Increase parallelism within processor (multi operand functional units like vector units)
● Increase parallelism on chip (multiple processors on chip)
● Multi processor computers● Multi computer systems using a communication network
(latency and bandwidth considerations)
Parallel computing paradigmsParallel computing paradigms
● memory taxonomy:memory taxonomy:
● SMP Shared Memory Parallelism
● One processor can “ see ” another's memory● Cray X-MP, single node NEC SX-3/4/5/6
● DMP Distributed Memory Parallelism
● Processors exchange “ messages ”● Cray T3D, IBM SP, ES-40, ASCI machines
● hardware taxonomy:
SISD (Single Instruction Single Data)
SIMD (Single Instruction Multiple Data)
MISD (Multiple Instruction Single Data)
MIMD (Multiple Instruction Multiple Data)
● programmer taxonomyprogrammer taxonomy:
SPMD : Single Program Multiple Data
MPMD : Multiple Program Multiple Data
SMP architecturesSMP architectures
Network / crossbarBus topology
Cpu
Cpu
Cpu
Cpu
Mem
Mem
Mem
NODE
Cpu
Cpu
Cpu
Cpu
Mem
Mem
Mem
Mem
NODE
SMPOpenMP (microtasking / autotasking)
OpenMP works at the loop level (small granularity often at the loop level), multiple CPUs execute the same code in a shared memory space
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.
OpenMP uses the fork-join model of parallel execution:
OpenMP Basic features (FORTRAN “comments”)
PROGRAM VEC_ADD_SECTIONS
INTEGER ni, I, n PARAMETER (ni=1000) REAL A(ni), B(ni), C(ni)
! Some initializations n=4 DO I = 1, ni A(I) = I * 1.0 B(I) = A(I) ENDDO! At the Fortran level: call omp_set_num_threads(n)
!$OMP PARALLEL SHARED(A,B,C), PRIVATE(I)
!$omp do DO I = 1, ni C(I) = A(I) + B(I) ENDDO!$omp enddo
!$OMP END PARALLEL
END
2 ways to initiate threads:
At the shell level:n=4export OMP_NUM_THREADS=n
Parallel region
OpenMP!$omp parallel!$omp do do n=1,omp_get_max_threads() call itf_phy_slb ( n , F_stepno,obusval, cobusval, $ pvptr, cvptrp,cvptrm, ndim, chmt_ntr, $ trp,trm, tdu,tdv,tdt,kmm,ktm, $ LDIST_DIM, l_nk) enddo!$omp enddo!$omp end parallel
!$omp critical jdo = jdo + 1!$omp end critical
!$omp single call vexp (expf_8,xmass_8,nij)!$omp end single
SMP: General remarksSMP: General remarks
● Shared memory parallelism at the loop level can often be implemented after the fact if what is desired is a moderate level of parallelism
● It can be also done to a lesser extent at the thread level in some cases but reentrancy, data scope (thread local vs global) and race conditions can be a problem.
● Does NOT scale all that well
● Limited to the real estate of a node
DMP architecture
Node Node Node
Node Node Node
High speed interconnect (network / crossbar)
. . . . . .
. . .. . .
Cpu
Cpu
Cpu
Cpu
Mem
Mem
Mem
NODE
2D domain decomposition:2D domain decomposition:regular horizontal block partitioningregular horizontal block partitioning
Gni
Lnj
1
Lnj
1
1 Lni 1 Lni
1
1
Gnj
Lni Lni+1
Lnj
Lnj+1
global indexing
local indexingPe (0,0)
Pe (0,1) Pe (1,1)
Pe (1,0)
N
S
W E
PE topology: npex=2, npey=2
PE #0 PE #1
PE #2 PE #3Rank
PE matrix
High level operations
● Halo exchange● What is a halo ?● Why and when is it necessary to exchange a halo ?
● Data transpose● What is a data transpose ?● Why and when is it necessary to transpose data ?
● Collective and Reduction operations
2D array layout with halos
Halo y
Halo x Halo x
Halo y
Halo y
Halo x Halo x
Halo y
Mini Maxi1 Lni
1
Lnj
Minj
Maxj
Inner halo
Outer halo
Private data
N
S
W E
● Need to access neighboring data in order to perform local computation
● In general any stencil type discrete operator
dfdx(i) = (f(i+1) - f(i-1)) / (x(i+1)-x(i-1))
● Halo width depends on the operator
Halo exchange:Why and when?
1 Lni
Halo x Halo x
Halo x Halo x
Halo exchange 051
How many neighbor PEs must local PE exchange data with to get data from the shaded area (outer halo)?
Local pe
South
North
EastWest
North West North East
South West South East
PE topology: npex=3, npey=3
Data Transposition 051PE topology: npex=4, npey=4
X
Y
Z
npex
npey
X
Y
Z
T2
npex
npey
X
Y
Z
T1
npex
npey
What is MPI ?
● A Message Passing Interface
● Communications through messages can be
● Cooperative send / receive (democratic)
● One sided get / put (autocratic)
● Bindings defined for FORTRAN, C, C++
● For parallel computers, clusters, heterogeneous networks
● Full featured (but can be used in simple fashion)
Time
Message length
Startup
Tw = cost / word
● Include 'mpif.h'● Call MPI_INIT(ierr)● Call MPI_FINALIZE(ierr)● Call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)● Call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)● Call MPI_SEND(buffer,count,datatype,destination,tag,comm,ierr)● Call MPI_RECV(buffer,count,datatype,source,tag,comm,status,ierr)
MPI_gather, MPI_allgatherMPI_scatter, MPI_alltoallMPI_bcast,MPI_reduce, MPI_allreduce
mpi_summpi_min, mpi_max
The RPN_COMM toolkitMichel Valin
● NO INCLUDE FILE NEEDED (like mpif.h)
● Higher level of abstraction
● Initialization / termination of communications
● Topology determination
● Point to point operations
● Halo exchange● (Direct message to NSWE neighbor)
● Collective operations
● Transpose● Gather / distribute● Data reduction
● Equivalent calls to most frequently used MPI routines● MPI_[something] => RPN_COMM_[something]
Partitioning Global Data
Gni=62Gnj=25 PE topology: npex=4, npey=3
lni=16lnj=8
lni=16lnj=9
lni=15lnj=9
lni=16lnj=8
lni=15lnj=9
lni=16lnj=9
lni=15lnj=8
lni=16lnj=9
lni=16lnj=9
lni=14lnj=9
lni=16lnj=7
lni=16lnj=9
lni=16lnj=9
lni=14lnj=7
Valin(Gni + npex – 1) / npexThomas
Dimensions of largest subdomain NOT affected
checktopo -gni 62 -gnj 25 -gnk 58 -npx 4 -npy 2 -pil 7 -hblen 10
DMP Scalability
Scaling up with an optimumsubdomain dimension
Size: 500 x 50 onvector processor systems
Size: 100 x 50 oncache systems
Scaling up on a fixed size problem sze
Time to solution shouldremain the same
Time to solution shoulddecrease linearly with the
# of CPUs
700
750
800
850
900
0 5 10 15 20 25 30 35
SX4_1NodeSX4_2NodesVPP700
MC2 Performance on NEC SX4and Fujitsu VPP700
MC2 Performance on NEC SX4and Fujitsu VPP700
Flo
p R
ate
/ P
E (
MF
lop
s/s
ec
.)
Number of PEs
SX4: npx=2 VPP700: npx=1
Grid: 513 x 433 x 41
IFS Performance on NEC SX4and Fujitsu VPP700
IFS Performance on NEC SX4and Fujitsu VPP700
Number of PEs
100
1000
10 100
Fo
rec
as
t d
ay
s /
da
y
VPP700
SX-4
Amdahl's law for parallel programmingAmdahl's law for parallel programming The speedup factor is influenced very much by the residual serial (non parallelizable) work. As the number of processors grows, so does the damage caused by non parallelizable work.
Scalability: limiting factors
• Any algorithms requiring global communications
– One should THINK LOCAL
• SL transport on a global configuration lat-lon grid
point model – numerical poles (GEM)
• 2-time-level fully implicit discretization leading to an
elliptic problem: direct solver requires data transpose
• Any algorithms producing inherent load imbalance
DMP - General remarks
● More difficult but more powerful programming paradigm
● Easily combined with SMP (on all MPI processes)
● Distributed memory parallelism does not happen, it must be DESIGNED.
● One does not parallelizes a code, the code must be rebuilt (and often redesigned) taking into account the constraints imposed upon the dataflow by message passing. Array dimensioning and loop indexing are likely to be VERY HEAVVILY IMPACTED.
● One may get lucky and HPF or an automatic parallelizing compiler will solve the problem (if one believes in miracles, Santa Claus, the tooth fairy or all of them).
Web sites and Books
● http://pollux.cmc.ec.gc.ca/~armnmfv/MPI_workshop
● http://www.llnl.gov/ , OpenMP, threads, MPI, ...
● http://hpcf.nersc.gov/
● http://www.idris.fr/ , en français, OpenMP, MPI, F90
● Using MPI, Gropp et al, ISBN 0-262-57204-8
● MPI, The Compl. Ref., Snir et al, ISBN 0-262-69184-1
● MPI, The Compl. Ref. vol 2, Gropp et al, ISBN 0-262-57123-4