Upload
vancong
View
226
Download
0
Embed Size (px)
Citation preview
Intro Multicore Distributed Conclusion
Parallel Programming for Multicore andDistributed Systems
10th Summer School in Statistics for Astronomers
Pierre-Yves Taunay
Research Computing and Cyberinfrastructure224A Computer Building
The Pennsylvania State UniversityUniversity Park
June 2014
1 / 67
Intro Multicore Distributed Conclusion
Objectives
1. Have a good understanding of1.1 Shared memory programs executed on multicore machines, and1.2 Distributed programs executed on multiple multicore machines
2. Get familiar with programming syntax for2.1 Shared memory programming, and2.2 Distributed programming
3 / 67
Intro Multicore Distributed Conclusion
Recap from last session – 1/2
I Multicore programming↪→ Possible to have 2 or more cores in a processor cooperate
through threads↪→ Thread libraries: OpenMP, POSIX, Boost
I Distributed programming↪→ Multiple compute nodes are cooperating↪→ MPI for communication paradigm
4 / 67
Intro Multicore Distributed Conclusion
Recap from last session – 2/2
Shared memory Distributed memoryLimited to 1 machine No limit
1 process = multiple cores 1 process = 1 coreMultiple cores can Memory local
access same memory to processor coreData exchange Data exchange
through memory bus through network
5 / 67
Intro Multicore Distributed Conclusion
Disclaimers
I C for multicore programmingI C for distributed programming
6 / 67
Intro Multicore Distributed Conclusion
OpenMPGeneral concepts – 1/2
I Shared memory: all threads can access same data poolI Still possible to have private data for each thread
8 / 67
Intro Multicore Distributed Conclusion
OpenMPGeneral concepts – 2/2
I Fork-join model of execution
I Directive based: give “hints” to the compiler
10 / 67
Intro Multicore Distributed Conclusion
OpenMPCompiling a program
I Intel compiler: -openmp flagI GCC: -fopenmp flag
gcc -fopenmp -O3 myprogram.c -o myprogram
12 / 67
Intro Multicore Distributed Conclusion
OpenMPRunning a program
I Environment variables are set to control programI Number of threads to launch: OMP_NUM_THREADS
export OMP_NUM_THREADS = 4
setenv OMP_NUM_THREADS 4
13 / 67
Intro Multicore Distributed Conclusion
OpenMPTiming parts of a program
I Function omp_get_wtime()
1 double t i c = omp_get_wtime ( ) ;2 // Code to t ime . . .3 double toc = omp_get_wtime ( ) ;4 p r i n t f ( " Re s u l t : %f \n" , toc−t i c ) ;
14 / 67
Intro Multicore Distributed Conclusion
Parallel control structure
I In a parallel block, code is executed by all threads
1 #inc lude <s t d i o . h>2 #inc lude <omp.h>3
4 i n t main ( ) {5 #pragma omp p a r a l l e l6 p r i n t f ( " He l l o wor ld ! \n" ) ;7
8 re tu rn 0 ;9 }
15 / 67
Intro Multicore Distributed Conclusion
Work sharingfor directive – 1/3
I Multiple ways to share workI Simple “for” loop: iterations are distributed amongst threads
1 #pragma omp p a r a l l e l2 #pragma omp f o r3 f o r ( i n t i = 0 ; i<N; i++)4 a [ i ] = b [ i ] + c [ i ] ;
17 / 67
Intro Multicore Distributed Conclusion
Work sharingfor directive – 2/3
I Additional options:↪→ reduction: to perform a reduction↪→ ordered: iterations are executed in same order as a serial core↪→ nowait: threads do not synchronize at end of loop
1 #pragma omp p a r a l l e l2 #pragma omp f o r r e d u c t i o n (+: out )3 f o r ( i n t i = 0 ; i<N; i++)4 out = out + a [ i ] ;
18 / 67
Intro Multicore Distributed Conclusion
Work sharingfor directive – 3/3
1 #pragma omp p a r a l l e l2 {3 #pragma omp f o r nowa i t4 f o r ( i n t i = 0 ; i<N; i++)5 c [ i ] = a [ i ] + b [ i ] ;6
7 #pragma omp f o r8 f o r ( i n t i = 0 ; i<N; i++)9 z [ i ] = x [ i ] + y [ i ] ;
10 } /∗ End o f p a r a l l e l ∗/
19 / 67
Intro Multicore Distributed Conclusion
Work sharingsections directive
I Distributes a set of structured blocks amongst a “team” ofthreads
I Does not have to have iterations, as in the for directiveI Synchronization is implied at the end of sectionsI Stand-alone section directives can be nested in a sectionsI Each section is executed once by a thread.
20 / 67
Intro Multicore Distributed Conclusion
Work sharingsections directive
1 #pragma omp p a r a l l e l2 {3 #pragma omp s e c t i o n s nowa i t4 {5 #pragma omp s e c t i o n6 f o r ( i n t i = 0 ; i<N; i++)7 c [ i ] = a [ i ] + b [ i ] ;8
9 #pragma omp s e c t i o n10 p r i n t f ( " He l l o !\ n" ) ;11 } /∗ End o f s e c t i o n s ∗/12 } /∗ End o f p a r a l l e l ∗/
21 / 67
Intro Multicore Distributed Conclusion
Work sharingsingle directive
I Single: only one thread executes that block
1 #pragma omp p a r a l l e l2 {3 #pragma omp s i n g l e4 {5 p r i n t f ( " S t a r t i n g o p e r a t i o n s . . . \ n" ) ;6 } /∗ End o f s i n g l e ∗/78 #pragma omp s e c t i o n s nowa i t9 {
10 #pragma omp s e c t i o n11 f o r ( i n t i = 0 ; i <N; i ++)12 c [ i ] = a [ i ] + b [ i ] ;1314 #pragma omp s e c t i o n15 f o r ( i n t i = 0 ; i <N; i ++)16 z [ i ] = x [ i ] + y [ i ] ;17 } /∗ End o f s e c t i o n s ∗/18 } /∗ End o f p a r a l l e l ∗/
22 / 67
Intro Multicore Distributed Conclusion
Parallel work sharing
I Combine directives
1 #pragma omp p a r a l l e l f o r2 f o r ( i n t i = 0 ; i < N ; i++)3 c [ i ] = a [ i ] + b [ i ] ;
23 / 67
Intro Multicore Distributed Conclusion
Data model
I Threads can access global shared data poolI Data can be also private to a thread
25 / 67
Intro Multicore Distributed Conclusion
Data controlprivate directive
I Control of data is made with pragmas too
1 i n t my_thread_num = 0 ;2
3 #pragma omp p a r a l l e l p r i v a t e (my_thread_num)4 {5 // Get the th r ead number6 my_thread_num = omp_get_thread_num ( )7
8 p r i n t f ( " He l l o from th r ead %d\n" ,my_thread_num) ;9 } /∗ End p a r a l l e l ∗/
I Remark: private creates a private variable for each thread,but does not copy the data. Use firstprivate or lastprivate todo so.
26 / 67
Intro Multicore Distributed Conclusion
Data controlDangers of shared resources
1 #pragma omp p a r a l l e l s ha r ed ( a , b , c , x , y , z )2 {3 #pragma omp f o r nowa i t4 f o r ( i n t i = 0 ; i < N; i++)5 a [ i ] = b [ i ] + c [ i ] ;6
7 #pragma omp f o r8 f o r ( i n t i = 0 ; i < N; i++)9 z [ i ] = a [ i ]∗ x [ i ] + y [ i ] ;
10 } /∗ End p a r a l l e l ∗/
I Can not ensure that “a” was updated before being used again:race condition
27 / 67
Intro Multicore Distributed Conclusion
Detecting race conditionsI valgrind with the helgrind toolI Compile with -g
gcc -g race.c -o race -fopenmp -std=c99I Run valgrind
valgrind - -tool=helgrind ./race
28 / 67
Intro Multicore Distributed Conclusion
Controlling the execution flowbarrier directive
1 #pragma omp p a r a l l e l s ha r ed ( a , b , c , x , y , z )2 {3 #pragma omp f o r nowa i t4 f o r ( i n t i = 0 ; i < N; i++)5 a [ i ] = b [ i ] + c [ i ] ;6
7 #pragma omp b a r r i e r8
9 #pragma omp f o r10 f o r ( i n t i = 0 ; i < N; i++)11 z [ i ] = a [ i ]∗ x [ i ] + y [ i ] ;12 } /∗ End p a r a l l e l ∗/
I ExpensiveI Wasted resourcesI ...but sometimes necessary !
30 / 67
Intro Multicore Distributed Conclusion
Controlling the execution flowOther directives
I masterI criticalI atomic
31 / 67
Intro Multicore Distributed Conclusion
Worked example II Want to calculate log–likelihood on several processes, at once
1 i n t main ( i n t argc , char ∗ a rgv [ ] ) {2 . . .3 // Parse the command l i n e4 r e t = parse_command_line ( argc , argv ,5 &nobs ,& s i z e x ,& nsample ,& l o c a t i o n ) ;67 // Parse the data on master8 double ∗buf fe r_X = ( double ∗) mal loc ( nobs∗ s i z e x ∗ s i z e o f ( double ) ) ;9 double ∗ i s i g m a = ( double ∗) mal loc ( s i z e x ∗ s i z e x ∗ s i z e o f ( double ) ) ;
10 double ∗mu = ( double ∗) mal loc ( s i z e x ∗ s i z e o f ( double ) ) ;11 double det_sigma = 0 . 0 ;1213 r e t = read_data ( buffer_X , i s i gma , &det_sigma , mu,14 &nobs , &s i z e x , l o c a t i o n ) ;1516 // Thread v a r i a b l e s17 i n t n t h r e a d s = 1 ;18 i n t th_num = 0 ;19 i n t th_nobs = nobs ;20 n t h r e a d s = get_num_threads ( ) ;21 // Timing v a r i a b l e s22 double t i c , toc , to t_t ime = 0 . 0 ;23
32 / 67
Intro Multicore Distributed Conclusion
Worked example II24 // // A r r ay s f o r a l l t h r e a d s25 // The poo l i s a l l o c a t e d i n s i d e the sha r ed memory26 double ∗pool_LV = ( double ∗) mal loc ( nobs∗ s i z e x ∗ s i z e o f ( double ) ) ; //
L e f t hand s i d e v e c t o r (X−mu)27 double ∗pool_tmp = ( double ∗) mal loc ( nobs∗ s i z e x ∗ s i z e o f ( double ) ) ; //
Temporary h o l d e r f o r (X−mu)∗SIG28 double ∗poo l_ones = ( double ∗) mal loc ( nobs∗ s i z e o f ( double ) ) ; //
Temporary h o l d e r to c r e a t e LV29 double ∗ p o o l _ r e s = ( double ∗) mal loc ( n t h r e a d s∗ s i z e o f ( double ) ) ; //
Each t h r e a d put s i t s r e s u l t i n p o o l _ r e s3031 // Use p o i n t e r s to ge t the c o r r e c t l o c a t i o n i n the a r r a y32 double ∗LV = NULL ;33 double ∗tmp = NULL ;34 double ∗ones = NULL ;35 double ∗X = NULL ;3637 // Ho lde r f o r f i n a l sum38 double f i na l_sum = 0 . 0 ;394041424344 // Main d r i v e r
33 / 67
Intro Multicore Distributed Conclusion
Worked example III45 #pragma omp p a r a l l e l p r i v a t e ( ones , LV , tmp , X, th_num , th_nobs )
d e f a u l t ( sha r ed )46 {47 // Get t h r e a d number48 th_num = omp_get_thread_num ( ) ;49 // Tota l number o f o b s e r v a t i o n s f o r t h a t t h r e a d50 th_nobs = nobs / n t h r e a d s ;5152 // Use the a d d r e s s to p o i n t to the c o r r e c t l o c a t i o n i n the
v e c t o r53 X = &buf fe r_X [ th_num∗nobs∗ s i z e x / n t h r e a d s ] ;54 LV = &pool_LV [ th_num∗ th_nobs∗ s i z e x ] ;55 tmp = &pool_tmp [ th_num∗ th_nobs∗ s i z e x ] ;56 ones = &pool_ones [ th_num∗ th_nobs ] ;5758 // Each p r o c e s s can now c a l c u l a t e the term i n the59 // exponent f o r a s u b s e t o f random v e c t o r s60 l o g _ l i k e l i h o o d (X, i s i gma , mu, det_sigma , th_nobs ,61 s i z e x ,& p o o l _ r e s [ th_num ] , LV , tmp , ones ) ;6263 #pragma omp b a r r i e r6465 // Reduct ion : sum a l l the i n t e r m e d i a r y r e s u l t s66 #pragma omp f o r r e d u c t i o n (+: f i na l_sum )67 f o r ( i n t i = 0 ; i < n t h r e a d s ; i ++)68 f i na l_sum = f ina l_sum + p o o l _ r e s [ i ] ;
34 / 67
Intro Multicore Distributed Conclusion
Worked example IV
69 } /∗ End o f p a r a l l e l ∗/70 toc = omp_get_wtime ( ) ;71 tot_t ime += toc−t i c ;72 . . .73 }
35 / 67
Intro Multicore Distributed Conclusion
Summary
I Multicore / shared memory: use OpenMPI Language extensions for C,C++, and FortranI Compiler directives
+ Relatively easy to use: speedup with not too much effort+ Good scalability on 1 node+ No network communication – low latency, high BW- Threads are heavy- Limited to one compute node- Overhead if not enough work for threads
36 / 67
Intro Multicore Distributed Conclusion
Going further with OpenMP
I OpenMP documentationhttp://openmp.org/wp/resources/
I LLNL OpenMP tutorialhttp://computing.llnl.gov/tutorials/openMP/
I RCC HPC Essentialshttp://rcc.its.psu.edu/education/seminars/pages/hpc_essentials/HPC2.pdf
37 / 67
Intro Multicore Distributed Conclusion
MPIGeneral concepts – 1/3
I Process work unit on a single coreI Rank process numberI Distributed memory multiple compute nodesI Message Passing communication paradigmI Send / ReceiveI Buffer space where data is stored for send / receive opsI Synchronous / Asynchronous wait / don’t wait for transfer to be
complete (ACK from receiver)I Blocking / Non-blocking completion of comm. is (in)dependent
of events (e.g. buffer that contained data is available for reuse)I Communicators, groups can specify topology –
MPI_COMM_WORLD
39 / 67
Intro Multicore Distributed Conclusion
MPIGeneral concepts – 2/3
I Distributed memory: each process has its own dataI Collaboration through message exchange
40 / 67
Intro Multicore Distributed Conclusion
MPIGeneral concepts – 2/3
I Distributed memory: each process has its own dataI Collaboration through message exchange
41 / 67
Intro Multicore Distributed Conclusion
MPIGeneral concepts – 3/3
I Process is started on each specified core
43 / 67
Intro Multicore Distributed Conclusion
MPICompiling a program
I Choice 1: Use the MPI wrappers to compile and linkmpicc -c myfile1.c -o myfile1.ompicc -c myfile2.c -o myfile2.ompicc myfile1.o myfile2.o -o myprogram
I Choice 2: Use any compiler; have to provide include filelocation and libraries location as wellgcc -c myfile1.c -o myfile1.o -I/path/to/mpi/includegcc -c myfile2.c -o myfile2.o -I/path/to/mpi/include
gcc myfile1.o myfile2.o -o myprogram-L/path/to/mpi/libs -lmpi
44 / 67
Intro Multicore Distributed Conclusion
MPIRunning a program
I Use the command “mpirun”
> mpirun ./hello
45 / 67
Intro Multicore Distributed Conclusion
MPIProgram structure – 1/2
1. Include file1 #inc lude <mpi . h>
2. Program start...1 i n t main ( i n t argc , char ∗ a rgv [ ] ) {
3. Initialize the MPI environment1 MPI_Init(&argc ,& argv ) ;
4. Do stuff...5. “Finalize” the MPI environment
1 MPI_Finalize ( ) ;
46 / 67
Intro Multicore Distributed Conclusion
MPIProgram structure – 2/2
1 #inc lude <mpi . h>2 #inc lude <s t d i o . h>3
4 i n t main ( i n t argc , char ∗ a rgv [ ] ) {5
6 MPI_Init(&argc ,& argv ) ;7 p r i n t f ( " He l l o wor ld !\ n" ) ;8 MPI_Finalize ( ) ;9
10 re tu rn 0 ;11 }
Hello world !Hello world !
47 / 67
Intro Multicore Distributed Conclusion
MPISize and rank – 1/2
I What is the process ID ?1 MPI_Comm_rank(comm,∗ rank ) ;2 // Usage :3 i n t my_rank ;4 MPI_Comm_rank(MPI_COMM_WORLD,&my_rank ) ;
I How many processes did we launch ?1 MPI_Comm_size(comm,∗ s i z e ) ;2 // Usage :3 i n t number_proc ;4 MPI_Comm_size(MPI_COMM_WORLD,&number_proc ) ;
48 / 67
Intro Multicore Distributed Conclusion
MPISize and rank – 2/2
1 MPI_Init(&argc ,& argv ) ;2 i n t my_rank , number_proc ;3 MPI_Comm_rank(MPI_COMM_WORLD,&my_rank ) ;4 MPI_Comm_size(MPI_COMM_WORLD,&number_proc ) ;5
6 p r i n t f ( " He l l o from p r o c e s s %d o f %d !\ n" , my_rank ,number_proc ) ;
7 MPI_Finalize ( ) ;
Hello from process 1 of 2Hello from process 0 of 2
49 / 67
Intro Multicore Distributed Conclusion
MPISend / Recv – 1/4
I Message Passing: point-to-point
1 // Send ing data :2 MPI_Send(∗ bu f f e r , count , type , d e s t i n a t i o n , tag , comm) ;3 // Re c e i v i n g data :4 MPI_Recv(∗ bu f f e r , count , type , source , tag , comm, s t a t u s ) ;
50 / 67
Intro Multicore Distributed Conclusion
MPISend / Recv – 2/4
1 i n t my_rank , number_proc ;2 MPI_Comm_rank(MPI_COMM_WORLD,&my_rank ) ;3 // Proce s s 0 wants to send 10 i n t e g e r s to p r o c e s s 14 i f (my_rank == 0) {5 // Send 10 i n t e g e r s from v e c t o r V6 MPI_Send(V, 10 ,MPI_INT , 1 , 0 ,MPI_COMM_WORLD) ;7 } e l s e i f (my_rank == 1) {8 // Rece i v e 10 i n t e g e r s from v e c t o r V ; s t o r e i n U9 MPI_Recv(U, 10 ,MPI_INT , 0 , 0 ,MPI_COMM_WORLD) ;
10 }
51 / 67
Intro Multicore Distributed Conclusion
MPISend / Recv – 3/4
1 i f ( my_rank == 0) {2 i n t ∗V = ( i n t ∗) mal loc (NELEM∗ s i z e o f ( i n t ) ) ;3 f o r ( i n t i = 0 ; i < NELEM; i ++)4 V[ i ] = rand ( ) ;56 p r i n t_vec to r (V,NELEM, my_rank ) ;78 i n t d e s t = 1 ;9 MPI_Send(V,NELEM, MPI_INT , dest , 0 ,MPI_COMM_WORLD) ;
10 p r i n t f ( "New v e c t o r V \n" ) ;11 p r i n t_vec to r (V,NELEM, my_rank ) ;1213 f r e e (V) ;1415 } e l s e i f ( my_rank == 1) {16 i n t ∗U = ( i n t ∗) mal loc (NELEM∗ s i z e o f ( i n t ) ) ;1718 p r i n t_vec to r (U,NELEM, my_rank ) ;1920 i n t s r c = 0 ;21 MPI_Status s t a t u s ;22 MPI_Recv (U,NELEM, MPI_INT , s r c , 0 ,MPI_COMM_WORLD,& s t a t u s ) ;23 p r i n t f ( "New v e c t o r U\n" ) ;24 p r i n t_vec to r (U,NELEM, my_rank ) ;2526 f r e e (U) ;27 } 52 / 67
Intro Multicore Distributed Conclusion
MPISend / Recv – 4/4
OutputVector VProcess 0V[0] = 1918581883V[1] = 1004453085V[2] = 786820889
Vector UProcess 1V[0] = 24259120V[1] = 0V[2] = 24265808
New vector VProcess 0V[0] = 1918581883V[1] = 1004453085V[2] = 786820889
New vector UProcess 1V[0] = 1918581883V[1] = 1004453085V[2] = 786820889
53 / 67
Intro Multicore Distributed Conclusion
MPIA word on distributed memory
I Watch where you allocate memory !1 i n t ∗V;2 i f (my_rank == 0) {3 V = ( i n t ∗)malloc (N∗ s i z e o f ( i n t ) ) ;4 }5 . . .6 // Seg . f a u l t e r r o r −− on l y the p r o c e s s 0 has
a l l o c a t e d memory !7 i f (my_rank == 1) {8 V [ 0 ] = 1 ;9 }
54 / 67
Intro Multicore Distributed Conclusion
MPICollectives
I Involves all processes in a communicatorI Examples: broadcast, gather, scatter
55 / 67
Intro Multicore Distributed Conclusion
MPICollectives
I Message Passing: broadcast
1 MPI_Bcast (∗ data , count , type , root , comm)
1 . . .2 i n t data = rand ( ) ;3 // P r o c e s s 0 sends i t s v a l u e to eve r yon e4 MPI_Bcast(&data , 1 , MPI_INT , 0 ,MPI_COMM_WORLD) ;5 . . .
57 / 67
Intro Multicore Distributed Conclusion
MPICollectives
I Message Passing: scatter1 MPI_Scatter (∗ send_data , send_count , send_datatype ,2 ∗ recv_data , recv_count , r ecv_datatype ,3 root , comm)
1 . . .2 i n t ∗ data = NULL ; // S i z e = 2∗N e l ement s3 i n t ∗ r e c v _ b u f f e r = NULL ; // S i z e = 2 e l ement s4 . . .5 f o r ( i n t i = 0 ; i <2∗N; i ++)6 data [ i ] = rand ( ) ;78 // P r o c e s s 0 sends 2 v a l u e s to eve r y one9 // 0 : [ 0 , 1 ] , 1 : [ 2 , 3 ] , e t c .
10 MPI_Scatter ( data , 2 , MPI_INT , r e c v _ b u f f e r , 2 , MPI_INT , 0 ,MPI_COMM_WORLD) ;11 . . . 58 / 67
Intro Multicore Distributed Conclusion
MPICollectives
I Message Passing: gather1 MPI_Gather (∗ send_data , send_count , send_datatype ,2 ∗ recv_data , recv_count , r ecv_datatype ,3 root , comm)
1 . . .2 i n t ∗ data = NULL ; // S i z e = 2 e l ement s3 i n t ∗ r e c v _ b u f f e r = NULL ; // S i z e = 2∗N e l ement s4 . . .5 f o r ( i n t i = 0 ; i <2; i ++)6 data [ i ] = rand ( ) ;78 // P r o c e s s 0 r e c e i v e s 2 e l ement s from N p r o c e s s e s9 MPI_Gather ( data , 2 , MPI_INT , r e c v _ b u f f e r ,2∗N, MPI_INT , 0 ,MPI_COMM_WORLD) ;
10 . . .59 / 67
Intro Multicore Distributed Conclusion
MPICollectives
I Other efficient collective: reductions
60 / 67
Intro Multicore Distributed Conclusion
Worked example II Want to calculate log–likelihood on several processes, at once
1 i n t main ( i n t argc , char ∗ a rgv [ ] ) {2 // S t a r t MPI3 MPI_Init(&argc ,& argv ) ;4 // Get rank and number o f p roc .5 i n t myrank , nproc = 0 ;6 MPI_Comm_rank(COMM,&myrank ) ;7 MPI_Comm_size(COMM,& nproc ) ;89 i n t nobs , s i z e x , nsample = 0 ;
10 char ∗ l o c a t i o n = NULL ;11 i n t r e t = 0 ;1213 // Parse the command l i n e and s e t proc_nobs14 // Number o f o b s e r v a t i o n s pe r p r o c e s s15 i n t proc_nobs ;16 r e t = parse_command_line ( argc , argv ,& nobs ,& s i z e x ,& nsample ,17 &l o c a t i o n ,& proc_nobs , myrank , nproc ) ;1819 // A l l o c a t e the data f o r ALL p r o c e s s e s .20 double ∗ i s i g m a = ( double ∗) mal loc ( s i z e x ∗ s i z e x ∗ s i z e o f ( double ) ) ;21 double ∗mu = ( double ∗) mal loc ( s i z e x ∗ s i z e o f ( double ) ) ;22 double det_sigma = 0 . 0 ;23
61 / 67
Intro Multicore Distributed Conclusion
Worked example II24 // Only the f i r s t p r o c e s s w i l l p a r s e the data . Others w i l l wa i t25 double ∗buf fe r_X = NULL ;26 i f ( myrank == 0 ) {27 i n t r e t = 0 ;28 buf fe r_X = ( double ∗) mal loc ( nobs∗ s i z e x ∗ s i z e o f ( double ) ) ;29 r e t = read_data ( buffer_X , i s i gma , &det_sigma ,30 mu, &nobs , &s i z e x , l o c a t i o n ) ;31 }32 // Wait f o r the p r o c e s s 0 to f i n i s h p a r s i n g the f i l e s33 MPI_Barrier (COMM) ;3435 // Timing v a r i a b l e s36 double t i c , toc , to t_t ime = 0 . 0 ;3738 // Sums39 double f i na l_sum = 0 . 0 ;40 double my_sum = 0 . 0 ;4142 // I n d i v i d u a l a r r a y s43 double ∗X = ( double ∗) mal loc ( proc_nobs∗ s i z e x ∗ s i z e o f ( double ) ) ;44 double ∗LV = ( double ∗) mal loc ( proc_nobs∗ s i z e x ∗ s i z e o f ( double ) ) ;45 double ∗tmp = ( double ∗) mal loc ( proc_nobs∗ s i z e x ∗ s i z e o f ( double ) ) ;46 double ∗ones = ( double ∗) mal loc ( proc_nobs∗ s i z e o f ( double ) ) ;474849 // P r o c e s s 0 sends data to eve r yo ne wi th c o l l e c t i v e s
62 / 67
Intro Multicore Distributed Conclusion
Worked example III50 MPI_Bcast ( i s i gma , s i z e x ∗ s i z e x ,MPI_DOUBLE, 0 ,COMM) ;51 MPI_Bcast (mu, s i z e x ,MPI_DOUBLE, 0 ,COMM) ;52 MPI_Bcast(&det_sigma , 1 ,MPI_DOUBLE, 0 ,COMM) ;53 MPI_Scatter ( buffer_X , proc_nobs∗ s i z e x ,MPI_DOUBLE, X, proc_nobs∗ s i z e x ,
MPI_DOUBLE, 0 ,COMM) ;5455 t i c = omp_get_wtime ( ) ;56 f i na l_sum = 0 . 0 ;57 l o g _ l i k e l i h o o d (X, i s i gma , mu, det_sigma , proc_nobs , s i z e x ,&my_sum , LV ,
tmp , ones ) ;5859 // Combine a l l the i n t e r m e d i a r y sums i n a s i n g l e one60 MPI_Reduce(&my_sum,& f ina l_sum , 1 ,MPI_DOUBLE,MPI_SUM, 0 ,COMM) ;61 MPI_Barrier (COMM) ;62 toc = omp_get_wtime ( ) ;63 tot_t ime += toc−t i c ;64 . . .65 i f ( myrank == 0 ) {66 f r e e ( buf fe r_X ) ;67 }68 // Etc .69 MPI_Final ize ( ) ;70 r e t u r n EXIT_SUCCESS ;71 }
63 / 67
Intro Multicore Distributed Conclusion
Summary
I Distributed programming: use MPII Library for C, C++, and FortranI Programmer responsible for a lot of stuff !
+ Standard+ Optimized collective communication+ Can overlap communication and computation- Strong learning curve- Difficult (?) to debug- Communication through network = optimization issues
64 / 67
Intro Multicore Distributed Conclusion
Going further with MPI
I OpenMPI documentationhttp://www.open-mpi.org/doc/
I MPI tutorialhttp://mpitutorial.com/
I LLNL MPI tutorialhttps://computing.llnl.gov/tutorials/mpi/
I RCC HPC Essentialshttp://rcc.its.psu.edu/education/seminars/pages/hpc_essentials/HPC3.pdf
65 / 67
Intro Multicore Distributed Conclusion
Conclusion
I Covered two approaches for shared and distributed mem.I Shared memory or distributed memory ?
↪→ Depends on pb. size↪→ Amdahl’s law !
I Combine best of both worlds: hybrid approachI Other easier (?) distributed approaches
↪→ Python – MPI4py↪→ R – RMPI↪→ Julia
66 / 67