Jacobi solver status

Jacobi solver status

Lucian Anton, Saif Mulla, Stef Salvini

CCP_ASEARCH meetingOctober 8, 2013

Daresbury

1

Outline• Code structure

– Front end– Numerical kernels– Data collection

• Performance data– Intel SB– Xeon Phi– BlueGeneQ– GPU

8/10/13 Jacobi test program 2

Code structure


• Read input from command line– Grid sizes, length of iteration block, # of iteration

blocks ,..– Algorithm to use– Output format (header, test iterations, …)

• Initialize grid with an eigenvalue of Jacobi smoother• Run several iteration blocks• Collect min, max, average times.

Build model


• Uses a generic Makefile + plaform/*.inc files• F90 := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \• source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && mpiifort

• CC := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \• source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && icc

• LANG = C

• ifdef USE_MIC• FMIC = -mmic• endif

• ifdef USE_MPI• FMPI=-DUSE_MPI• endif

• ifdef USE_DOUBLE_PRECISION• DOUBLE=-DUSE_DOUBLE_PRECISION• endif

• ifdef USE_VEC1D• VEC1D = -DUSE_VEC1D• endif

• #FC = module add intel/comp intel/mpi && mpiifort

Command line parameters


• arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -help• Usage: [-ng <grid-size-x> <grid-size-y> <grid-size-z> ] [ -nb <block-size-x> <block-

size-y> <block-size-z>] [-np <num-proc-x> <num-proc-y> <num-proc-z>] [-niter <num-iterations>] [-biter <iterations-block-size>] [-malign <memory-alignment> ] [-v] [-t] [-pc] [-model <model_name> [num-waves] [threads-per-column]] [-nh] [-help]

• arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -model help• possible values for model parameter:• baseline• baseline-opt• blocked• wave num-waves threads-per-column• basegpu• optgpu

• Note for wave model: if threads-per-column == 0 diagonal wave kernel is used.

README file


Full explanation on command line options are provided in README

• The following flags can be used to set the grid sized and other run parameters:

• -ng <nx> <ny> <nz> set the global gris sizes

• -nb <bx> <by> <bz> set the computational block size, relevant only for blocked model.

• Notes: 1) no sanity checks tests are done, you are on your own.

• 2) for blocked model the OpeNMP parallelism is done over

• computational blocks. One must ensure that there

• enough work for all threads by setting suitable

• block sizes.

Correctness check


• -t flag checks if norm ratio are close to Jacobi smoother eigenvalue

arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -t -niter 7Correctness checkiteration, norm ratio, deviation from eigenvalue 0 6.36918e+01 6.26966e+01 1 9.95185e-01 2.55054e-08 2 9.95185e-01 1.50473e-08 3 9.95185e-01 2.57243e-08 4 9.95185e-01 3.27436e-08 5 9.95185e-01 1.96427e-08 6 9.95185e-01 3.17978e-08# Last norm 6.187368259733268e+01#==========================================================================================================## NThs Nx Ny Nz NITER minTime meanTime

maxTime #==========================================================================================================#

8 33 33 33 1 1.299e-04 1.487e-04 1.690e-04

Algorithms


• Basic 3 loops iteration over the grid– OpenMP parallelism applied to external loop– If condition from inner loop eliminated

• Blocked iterations• Wave iterations

Algorithms: wave details


Z

Y

NewOld Old New

Algorithms: helping vectorisation


The inner loop can be replace with an easier to vectorize function:// 1D loop that helps the compiler to vectorize

static void vec_oneD_loop(const int n, const Real uNorth[], const Real uSouth[], const Real uWest[], const Real uEast[], const Real uBottom[], const Real uTop[], Real w[] ){ int i;

#ifdef __INTEL_COMPILER#pragma ivdep#endif#ifdef __IBMC__#pragma ibm independent_loop#endif for (i=0; i < n; ++i) w[i] = sixth * (uNorth[i] + uSouth[i] + uWest[i] + uEast[i] + uBottom[i] + uTop[i]);}

Algorithms: CUDA


• Base laplace3D (from Mike’s lecture notes)• Shared memory in XY plane• … more to come

Data collection


With such a large parameter space we have a big-ish data problem.Bash script + gnuplot

index=0for exe in $exe_listdo for model in $model_list do for nth in $threads_list do export OMP_NUM_THREADS=$nth for ((linsize=10; linsize <= max_linsize; linsize += step)) do biter=$(((10*max_linsize)/linsize)) niter=5 if [ "$model" = wave ] then nwave="$biter $((nth<biter?nth:biter))" echo "model $model $nwave" else nwave="" fi

if [ "$blk_x" -eq 0 ] ; then blk_xt=$linsize ; else blk_xt=$blk_x ; fi if [ "$blk_y" -eq 0 ] ; then blk_yt=$linsize ; else blk_yt=$blk_y ; fi if [ "$blk_z" -eq 0 ] ; then blk_zt=$linsize ; else blk_zt=$blk_z ; fi

echo "./"$exe" -ng $linsize $linsize $linsize -nb $blk_xt $blk_yt $blk_zt -model $model $nwave

SandyBrige baseline


SB: blocked and wave


BGQ


Xeon Phi vs SandyBridge


Fermi data


Conclusions & To do


• We have an integrate set of Jacobi smoother algorithms– OpenMP, CUDA, MPI(almost)– Flexible build system– Run parameters can be selected from command line and

preprocessor flags– Correctness check – Scripted data collection– README file

• Tested on several system (Idataplex, BGQ, Emerald,…, MacOs laptop)

• GPU needs further improvements• ….

Documents

Jacobi solver status