Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Issues in parallel languages for future HPC systems

Jesús Labarta Director Computer Science Department

Barcelona Supercomputing Center

•  … efficiency driven

•  In power management which implies efficient usage of resources

•  … strong scaled

•  The easy times of weak scaling are over

•  … extreme variability

•  System (hw & sw)

•  Load

•  … hierarchical

•  Driven by what can be done

•  … heterogeneous

•  On purpose or not (manufacturing process, faults,..)

J. Labarta, et all, “BSC Vision towards Exascale” IJHPCA vol 23, n. 4 Nov 2009

Application

Algorithm

Progr. Model

Run time

Architecture

Is any of them more important than the

others?

Which?

•  Language

•  Transport / integrate information

•  Experience Good “predictor”

•  Structures mind

•  We probably all agree in the general objectives, desirable characteristics, …

•  Portability, abstraction, flexibility, modularity, locality,...

•  We may differ in details …

•  …although the difference between “detail” and “fundamental” is often subtle, subjective,…

• … and details are extremely important

• Very “simple/small/subtle” differences very important impact

Neanderthal high larynx

H. sapiens low larynx

“Now the whole earth had one language and the same words” …

Book of Genesis

…”Come, let us make bricks, and burn them thoroughly. ”…

…"Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”…

And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."

The computer age

Fortran & MPI

++

Fortress

StarSs OpenMP

MPI

X10

Sequoia

CUDA Sisal

CAF SDK UPC

Cilk++

Chapel

HPF

ALF

RapidMind

•  A personal view on …

•  … some of the issues …

•  Productivity, complexity, programmability

•  Concurrency

•  Memory

•  … how they are approached by different programming models/languages …

•  … including our StarSs proposal

Productivity: complexity, programmability, portability,…

•  Number of concepts

•  Many of them not related to algorithmic issues. First solution to a given problem.

•  New concept new calls, arguments, interaction between concepts,… that can be used and have to be implemented

•  Implementation overhead/complexity ∝ (man pages, size of summary sheet)

Texture / shared / constant memory; Thread, warp, block, grid CudaMalloc, CudaMemcpy,… _syncthreads

Activities Places Storage classes: Activity-local, Place-local, Partitioned global, Immutable Points, Regions, Distributions, arrays Operations, algebra, restriction, async, foreach, at, ateach finish, atomic, when, clocks (declaration, clocked, next)

X10

Locales (here; var.id, .numCores, .physicalMemory, .locale, …) Domain, subdomain Distribution: Block, cyclic, blockCyc, CSR, DistGPU begin, cobegin, forall, coforall, serial Local, On full/empty, futures, sync Atomic sections

Chapel

Shared / local data Distribution. Pointers: (Resides, points to) x (shared, local) Memory copy/set operations Collective / non collective: memory allocation forall, threadof

UPC

~ 150/~6 calls Communicator Buffered, synchronous Blocking / nonblocking Status, request Short/long protocol Data types Collectives Topology One sided I/O Dynamic process mgmt.

•  Programming practices: frequent trend towards complexity

•  Enabled by programming model

•  Pushed by performance dazzling

•  Examples

•  RTM Stencil @ VMX

•  30 1000 lines; 20% performance improvement

•  Linpack

•  Include directory: 19 files, 1946 code lines

•  src directory: 113 files, 11945 code lines

•  Parameters: P, Q, Broadcast, lookahead

•  Optimize performance (platform) Distribution: U implementations

•  Potential to introduce improvements in general language characteristics, compile time checks and optimizations, …

•  Application porting cost: entry barrier; limitations to incremental parallelization, reuse of existing components,…

•  Environment development cost:

•  Complexity of additional compiler.

•  Support for different target platforms: compiler and runtime

!$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1))+(1.0-w(i))*a(i,j) enddo

!$acc end region

Generating copyin(w(2:n-1)), copyin(a(1:n,1:m)), copyout(b(2:n-1,2:m-1)) !$acc do parallel, vector(16) !$acc do parallel, vector(16)

PGI Accelerator

forall(i,j) in D do A(i,j) = i + j/10.0;

A = [(i,j) in D] i + j/10.0;

bigDiff= max reduce [i in InnerD] abs(A(i)-B(i));

Chapel

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

// copy result from device to host cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost);

n_workgroups = array_size / (vector_width * local_work_group_size);…; context = clCreateContext(0, (cl_uint) 1, &device_list, NULL, NULL, &rc); cmd_queue = clCreateCommandQueue(context,device_list, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &rc); posix_memalign((void **) &rawbuf, 128, sizeof(cl_double) * 7 * malloc_array_size);…; memobjs[0] = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, memsize, cpflag, &rc);…; bs_source = load_program (kernel_source_file, &rc); program = clCreateProgramWithSource (context, 1, (const char**)(&bs_source), NULL, &rc);

rc = clBuildProgram(program, 0, NULL, kernel_selector, NULL, NULL);…; kernel = clCreateKernel(program, "bsop_kernel", &rc);…; rc = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]);…; rc = bsop_rangeLS(double_flag, array_size, local_work_group_size);

OpenCL

Concurrency & synchronization

•  How to generate work

•  Synchronization

•  Hybrid/nested

•  Sequential code is important

•  Is overhead important

•  Malleability

•  Who generates

•  Application launch

•  Master

•  Nested

•  Flat concurrent

•  Hierarchical

•  Static/dynamic

•  Overhead/Granularity

•  Relation to variables declaration/scope

•  Scalar expansion/privatization

•  Relation to scheduling

Loop / Data parallel

Nested fork-join

SPMD

Futures DAG – data flow

Tasks

OpenMP 3.0 Cilk++

HMPP X10, Chapel

StarSs

OpenMP 2.5 CUDA, OpenCL

Ct MPI, CAF

•  “Kernel” based programming models

•  Sometimes (often) perceived as simple, natural.

•  May be limited

BIGF<<<dimGrid, dimBlock>>> smallf( … )

smallf (… )

i= whoami_i(); j= whoami_j(); k= whoami_k(); compute (i,j,k, … );

for (I=0; I<dimGrid[0]; I +=dimBlock[0]) for (J=0; J<dimGrid[1]; J +=dimBlock[1]) for (K=0; K<dimGrid[2]; K +=dimBlock[2]) for (i=I; i<I+dimBlock[0]; i++) for (j=J; j<J+dimBlock[0]; j++) for (k=K; k<K+dimBlock[0]; k++) compute( i, j, k, … );

Many potential interchanges with high performance impact

!$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1))+(1.0-w(i))*a(i,j) enddo

!$acc end region

!$acc do parallel, vector(16) !$acc do parallel, vector(16)

PGI Accelerator

•  Synchronization

•  Enforce partial order

•  Atomicity

•  Implicit in work generation scheme

•  Barriers at end of parallel

•  Nested fork – join

•  Explicit

•  Point to point

•  Global : clocks, barrier,

SPMD

•  Propagates perturbation

•  Barriers and pipelines

•  Micro load imbalance

•  Can the synchronization structure absorb it?

•  Special impact in hierarchical parallelism

•  PGAS: separate data transfer and synchronization

•  Move from point to point to global synchronization. Good ?

•  Asynchrony is crucial

•  to tackle variability

•  To tolerate latency/bandwidth limitations

•  How:

•  MPI: Isend irecv -- wait/watiall

•  HMPP: call -- synchronize

•  X10: async

•  Chapel: sync

•  UPC: upc_notify, upc_wait

•  Need to schedule waits in the source code:

•  Periodic probing during computations generic and totally independent/unaware of ongoing concurrency

•  Synchronize on what order?

Futures

Serialized explicit syncs

•  Data - flow Asynchrony

•  Graph dynamically generated at run time from execution of sequential program

•  Based on annotated arguments with directionality clauses

void Cholesky( float *A ) { int i, j, k; for (k=0; k<NT; k++) { chol_spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) chol_strsm (A[k*NT+k], A[k*NT+i]); // update trailing submatrix for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) chol_sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); chol_ssyrk (A[k*NT+i], A[i*NT+i]); } }

#pragma css task inout (A[TS][TS]) void chol_spotrf (float *A); #pragma css task input (A[TS][TS]) inout (C[TS][TS]) void chol_ssyrk (float *A, float *C); #pragma css task input (A[TS][TS], B[TS][TS}) inout (C[TS][TS]) void chol_sgemm (float *A, float *B, float *C);

•  Not to mention algorithmic complexity

float bsop_reference_float (unsigned int cpflag, float S0, float K, float r, float sigma, float T) { float d1, d2, …; d1 = logf(S0/K) + (r + 0.5*sigma*sigma)*T; d1 /= (sigma * sqrt(T));

answer = cpflag ? c : p; return answer;

float4 bsop_ref (unsigned int4 cpflag, float4 S0, float4 K, float4 r, float4 sigma, float4 T) { float4 d1, d2, …; int4 flag1, flag2;

d1 = log(S0/K) + (r + HALF * sigma*sigma)*T; d1 /= (sigma * sqrt(T));

HMPP: code restructuring pragmas to improve “sequential”

Leverage OpenCL kernels from StarSs “Vector” data types & intrinsics

•  StarSs version allowing strided and aliased references

•  Extremely simple application

•  Extremely complex dependence detection

void Cholesky (int N, int BS, float A[N][N]) { for (int j = 0; j < N; j+=BS) { for (int k= 0; k< j; k+=BS) for (int i = j+BS; i < N; i+=BS) sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]); for (int i = 0; i < j; i+=BS) ssyrk_tile(BS, N, &A[i][j], &A[j][j]); spotrf_tile(BS, N, &A[j][j]); for (int i = j+BS; i < N; i+=BS) strsm_tile(BS, N, &A[j][j], &A[j][i]);

#pragma css task input(T{0:BS}{0:BS}, BS, N) inout(B{0:BS}{0:BS}) void strsm_tile(integer BS, integer N, float T[N][N], float B[N][N]) { unsigned char LO='L', TR='T', NU='N', RI='R'; float DONE=1.0; integer LDT = sizeof(*T)/sizeof(float); integer LDB = sizeof(*B)/sizeof(float);

void gs (float A[(NB+2)*BS][(NB+2)*BS]) { int it,i,j;

for (it=0; it<NITERS; it++) for (i=0; i<N-2; i+=BS) for (j=0; j<N-2; j+=BS) gs_tile(&A[i][j]); }

#pragma css task \\ input(A{0}{1:BS}, A{BS+1}{1:BS}, \ A{1:BS}{0}, A{1:BS}{BS+1}) \ inout(A{1:BS}{1:BS}) void gs_tile (float A[N][N]) { for (int i=1; i <= BS; i++) for (int j=1; j <= BS; j++) A[i][j] = 0.2*(A[i][j] + A[i-1][j] +

•  Large overheads may not be a problem

•  If not in the critical path

•  And compensated by huge flexibilities

•  Should overhead drive designs?

•  Of course be considered, but drive ?

Even if ties up a significant fraction of a core, not yet the bottleneck

Acceptable fraction of total time

Significant reduction of task performance The importance of Locality aware scheduling

HUGE absolute value

•  How to allocate resources at the different levels

•  Number of processes and threads per process

•  Often done independently and for different (local) reasons

•  Strong interaction between schedules

•  Sweep3D example

k

i j

Phijb(i,k,m)

Phikb(i,j,m)

Phiib(j,k,m) Phii(i)

Phi(i)

•  Hybrid parallelization (128 procs)

•  Speedup SELECTED regions by the CPUratio factor

•  We do need to overcome the hybrid Amdahl’s law

•  asynchrony + Load balancing mechanisms !!!

93.67% 97.49% 99.11%

Code region

%el

apse

d tim

e

•  Linpack example

•  Overlap communication/computation

•  Extend asynchronous data-flow execution to outer level

•  Automatic lookahead

for (k=0; k<N; k++) { if (mine) { Factor_panel(A[k]); send (A[k])

} else { receive (A[k]); if (necessary) resend (A[k]); }

for (j=k+1; j<N; j++) update (A[k], A[j]);

#pragma css task inout(A[SIZE]) void Factor_panel(float *A); #pragma css task input(A[SIZE]) inout(B[SIZE]) void update(float *A, float *B);

#pragma css task input(A[SIZE]) void send(float *A); #pragma css task output(A[SIZE]) void receive(float *A); #pragma css task input(A[SIZE]) void resend(float *A);

P0 P1 P2

•  Performance •  High at small problem sizes

•  Improved load balance (less processes)

•  Higher IPC

•  Overlap communication/computation

•  TolerancetobandwidthandOSnoise

•  Flexible (dynamic) parallelization structure of an application

•  Allows responsiveness to dynamic characteristics of a computation and resource availability

•  Malleability requires

•  Separation between •  Algorithm: Problem logic. Programmer responsibility •  Scheduling: efficiency. System responsibility (hints may help)

•  Frequent control points

•  Issue of:

•  Programming model support

•  Programming practices

C$OMP PARALLEL WhoAmI=RunTimeCall() myBlock=f(WhoAmI)

Call Compute1(myBlock)

DO iters=1, #iters Call Compute2(myBlock) END DO

C$OMP END PARALLEL

C$OMP PARALLEL DO DO Block=1, #blocks Call Compute1(Block) ENDO C$OMP END PARALLEL … DO iter=1,#iters C$OMP PARALLEL DO DO Block=1, #blocks Call Compute1(Block) ENDO C$OMP END PARALLEL END DO … C$OMP END PARALLEL

Scheduling decisions: Once for all

Explicit code only related to parallelism

myBlock ∈ [ 1,#processors] ← f(resources) Block ∈ [ 1,#blocks] ← f(algorithm)

Dependences: not all arguments in directionality clause

Different implementations

Heterogeneous devices

Separation dependences/transfers

Memory

•  is THE problem

•  16 32 64 bits: A simple thing for a computer architect that has caused huge pain for everybody !!!!

•  Lessons from history: Virtual memory:

•  Decoupling logical and physical, automatically handled, programmer awareness at a quite abstract/simple level.

•  took 10 years to be accepted

•  Capacity: power, use?

•  Structure: Hierarchy

•  Transfers

•  Coherence / consistency

•  Association

Road Runner GPU

•  Moving from explicit API calls to automatic

float* h_A = (float*) malloc(mem_size_A);

float* d_A; float* d_B; float* d_C;

cudaMalloc((void**) &d_A, mem_size_A);

cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice);

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost);

•  Avoiding replicated transfers, overlapping,..

•  2D (nD) logical structure 1D “physical” structure

•  Objective: support efficient translation object name to storage container

•  “linear”

•  Colum/row major order: by now is probably in our genes.

•  Knowledge of how to next loops to avoid paging, cache misses,… •  Hardwired in widely used libraries: i.e. BLAS.

•  Blocked

•  Sequential

•  Parallel

•  Data distribution/layout (HPF, UPC,….)

•  Impacts performance: address generation, locality (distributed memory, cache), … •  Often fixes schedules (owner computes) •  Typically too early (declaration time) and static

•  “Convenient” association may vary with program phase explicit reassociations

•  Workarrays •  Data redistribution

•  A very important issue

•  Dependence on number of processors?

•  Predictability of performance?

•  Automatic handling of “convenient/possible” association

•  Transformation of address generation

Shared [2] int A[4][5] Shared [2][2] int A[4][5]

•  Automatic association management

•  Workarrays & Reshaping

void Cholesky (int N, int BS, float A[N][N]) { for (int j = 0; j < N; j+=BS) { for (int k= 0; k< j; k+=BS) for (int i = j+BS; i < N; i+=BS) sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]); for (int i = 0; i < j; i+=BS) ssyrk_tile(BS, N, &A[i][j], &A[j][j]); spotrf_tile(BS, N, &A[j][j]); for (int i = j+BS; i < N; i+=BS) strsm_tile(BS, N, &A[j][j], &A[j][i]); } }

#pragma css task input(T{0:BS}{0:BS}, BS, N) inout(B{0:BS}{0:BS}) void strsm_tile(integer BS, integer N, float T[N][N], float B[N][N]) { unsigned char LO='L', TR='T', NU='N', RI='R'; float DONE=1.0; integer LDT = sizeof(*T)/sizeof(float); integer LDB = sizeof(*B)/sizeof(float); strsm_(&RI, &LO, &TR, &NU, &BS, &BS, &DONE, T, &LDT, B, &LDB);

Using MKL kernels/tiles

SMPS - MUMA

SMPS interleaved memory allocation

SMPS first touch serial initialization

MKL 10.1

•  The potential of

•  Data distribution

•  Dynamic association management

•  Automatically achieved

Reshaped Regions

Regions Data distributed

across nodes

Regions

NUMA-aware Reshaped

Regions

IPC histogram

Invitation

•  Towards EXaflop applicaTions

•  Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a an productive and efficient way.

•  Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC

•  Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms.

•  Develop additional environment capabilities

•  tools (debug, performance)

•  improvements in runtime systems (load balance and GPUSs)

•  Support other users

•  Identify users of TEXT applications

•  Identify and support interested application developers

•  Contribute to Standards (OpenMP ARB, PERI-XML)

•  Language

•  Transport / integrate information

•  Experience Good “predictor”

•  Structures mind

•  We probably all agree in the general objectives, desirable characteristics, …

•  Portability, abstraction, flexibility, modularity, locality,...

•  We may differ in details …

•  …although the difference between “detail” and “fundamental” is often subtle, subjective,…

• … and details are extremely important

• Very “simple/small/subtle” differences very important impact

Neanderthal high larynx

H. sapiens low larynx

Wish we were able to develop language:

Communicated to humans Express Ideas Exploitable by machines

Separate

Algorithm, Ideas

Performance optimization, hints

Documents

Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled