Upload
gizi
View
91
Download
0
Embed Size (px)
DESCRIPTION
Jacobi solver status. Lucian Anton, Saif Mulla , Stef Salvini CCP_ASEARCH meeting October 8, 2013 Daresbury. Outline. Code structure Front end Numerical kernels Data collection Performance data Intel SB Xeon Phi BlueGeneQ GPU. Code structure. Read input from command line - PowerPoint PPT Presentation
Citation preview
Jacobi solver status
Lucian Anton, Saif Mulla, Stef Salvini
CCP_ASEARCH meetingOctober 8, 2013
Daresbury
1
Outline• Code structure
– Front end– Numerical kernels– Data collection
• Performance data– Intel SB– Xeon Phi– BlueGeneQ– GPU
8/10/13 Jacobi test program 2
Code structure
8/10/13 Jacobi test program 3
• Read input from command line– Grid sizes, length of iteration block, # of iteration
blocks ,..– Algorithm to use– Output format (header, test iterations, …)
• Initialize grid with an eigenvalue of Jacobi smoother• Run several iteration blocks• Collect min, max, average times.
Build model
8/10/13 Jacobi test program 4
• Uses a generic Makefile + plaform/*.inc files• F90 := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \• source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && mpiifort
• CC := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \• source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && icc
• LANG = C
• ifdef USE_MIC• FMIC = -mmic• endif
• ifdef USE_MPI• FMPI=-DUSE_MPI• endif
• ifdef USE_DOUBLE_PRECISION• DOUBLE=-DUSE_DOUBLE_PRECISION• endif
• ifdef USE_VEC1D• VEC1D = -DUSE_VEC1D• endif
• #FC = module add intel/comp intel/mpi && mpiifort
Command line parameters
8/10/13 Jacobi test program 5
• arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -help• Usage: [-ng <grid-size-x> <grid-size-y> <grid-size-z> ] [ -nb <block-size-x> <block-
size-y> <block-size-z>] [-np <num-proc-x> <num-proc-y> <num-proc-z>] [-niter <num-iterations>] [-biter <iterations-block-size>] [-malign <memory-alignment> ] [-v] [-t] [-pc] [-model <model_name> [num-waves] [threads-per-column]] [-nh] [-help]
• arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -model help• possible values for model parameter:• baseline• baseline-opt• blocked• wave num-waves threads-per-column• basegpu• optgpu
• Note for wave model: if threads-per-column == 0 diagonal wave kernel is used.
README file
8/10/13 Jacobi test program 6
Full explanation on command line options are provided in README
• The following flags can be used to set the grid sized and other run parameters:
• -ng <nx> <ny> <nz> set the global gris sizes
• -nb <bx> <by> <bz> set the computational block size, relevant only for blocked model.
• Notes: 1) no sanity checks tests are done, you are on your own.
• 2) for blocked model the OpeNMP parallelism is done over
• computational blocks. One must ensure that there
• enough work for all threads by setting suitable
• block sizes.
Correctness check
8/10/13 Jacobi test program 7
• -t flag checks if norm ratio are close to Jacobi smoother eigenvalue
arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -t -niter 7Correctness checkiteration, norm ratio, deviation from eigenvalue 0 6.36918e+01 6.26966e+01 1 9.95185e-01 2.55054e-08 2 9.95185e-01 1.50473e-08 3 9.95185e-01 2.57243e-08 4 9.95185e-01 3.27436e-08 5 9.95185e-01 1.96427e-08 6 9.95185e-01 3.17978e-08# Last norm 6.187368259733268e+01#==========================================================================================================## NThs Nx Ny Nz NITER minTime meanTime
maxTime #==========================================================================================================#
8 33 33 33 1 1.299e-04 1.487e-04 1.690e-04
Algorithms
8/10/13 Jacobi test program 8
• Basic 3 loops iteration over the grid– OpenMP parallelism applied to external loop– If condition from inner loop eliminated
• Blocked iterations• Wave iterations
Algorithms: wave details
8/10/13 Jacobi test program 9
Z
Y
NewOld Old New
Algorithms: helping vectorisation
8/10/13 Jacobi test program 10
The inner loop can be replace with an easier to vectorize function:// 1D loop that helps the compiler to vectorize
static void vec_oneD_loop(const int n, const Real uNorth[], const Real uSouth[], const Real uWest[], const Real uEast[], const Real uBottom[], const Real uTop[], Real w[] ){ int i;
#ifdef __INTEL_COMPILER#pragma ivdep#endif#ifdef __IBMC__#pragma ibm independent_loop#endif for (i=0; i < n; ++i) w[i] = sixth * (uNorth[i] + uSouth[i] + uWest[i] + uEast[i] + uBottom[i] + uTop[i]);}
Algorithms: CUDA
8/10/13 Jacobi test program 11
• Base laplace3D (from Mike’s lecture notes)• Shared memory in XY plane• … more to come
Data collection
8/10/13 Jacobi test program 12
With such a large parameter space we have a big-ish data problem.Bash script + gnuplot
index=0for exe in $exe_listdo for model in $model_list do for nth in $threads_list do export OMP_NUM_THREADS=$nth for ((linsize=10; linsize <= max_linsize; linsize += step)) do biter=$(((10*max_linsize)/linsize)) niter=5 if [ "$model" = wave ] then nwave="$biter $((nth<biter?nth:biter))" echo "model $model $nwave" else nwave="" fi
if [ "$blk_x" -eq 0 ] ; then blk_xt=$linsize ; else blk_xt=$blk_x ; fi if [ "$blk_y" -eq 0 ] ; then blk_yt=$linsize ; else blk_yt=$blk_y ; fi if [ "$blk_z" -eq 0 ] ; then blk_zt=$linsize ; else blk_zt=$blk_z ; fi
echo "./"$exe" -ng $linsize $linsize $linsize -nb $blk_xt $blk_yt $blk_zt -model $model $nwave
SandyBrige baseline
8/10/13 Jacobi test program 13
SB: blocked and wave
8/10/13 Jacobi test program 14
BGQ
8/10/13 Jacobi test program 15
Xeon Phi vs SandyBridge
8/10/13 Jacobi test program 16
Fermi data
8/10/13 Jacobi test program 17
Conclusions & To do
8/10/13 Jacobi test program 18
• We have an integrate set of Jacobi smoother algorithms– OpenMP, CUDA, MPI(almost)– Flexible build system– Run parameters can be selected from command line and
preprocessor flags– Correctness check – Scripted data collection– README file
• Tested on several system (Idataplex, BGQ, Emerald,…, MacOs laptop)
• GPU needs further improvements• ….