42
Issues in parallel languages for future HPC systems Jesús Labarta Director Computer Science Department Barcelona Supercomputing Center

Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Issues in parallel languages for future HPC systems

Jesús Labarta Director Computer Science Department

Barcelona Supercomputing Center

Page 2: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  … efficiency driven

•  In power management which implies efficient usage of resources

•  … strong scaled

•  The easy times of weak scaling are over

•  … extreme variability

•  System (hw & sw)

•  Load

•  … hierarchical

•  Driven by what can be done

•  … heterogeneous

•  On purpose or not (manufacturing process, faults,..)

J. Labarta, et all, “BSC Vision towards Exascale” IJHPCA vol 23, n. 4 Nov 2009

Page 3: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Application

Algorithm

Progr. Model

Run time

Architecture

Is any of them more important than the

others?

Which?

Page 4: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Language

•  Transport / integrate information

•  Experience Good “predictor”

•  Structures mind

•  We probably all agree in the general objectives, desirable characteristics, …

•  Portability, abstraction, flexibility, modularity, locality,...

•  We may differ in details …

•  …although the difference between “detail” and “fundamental” is often subtle, subjective,…

• … and details are extremely important

• Very “simple/small/subtle” differences very important impact

Neanderthal high larynx

H. sapiens low larynx

Page 5: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

“Now the whole earth had one language and the same words” …

Book of Genesis

…”Come, let us make bricks, and burn them thoroughly. ”…

…"Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”…

And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."

The computer age

Fortran & MPI

++

Fortress

StarSs OpenMP

MPI

X10

Sequoia

CUDA Sisal

CAF SDK UPC

Cilk++

Chapel

HPF

ALF

RapidMind

Page 6: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  A personal view on …

•  … some of the issues …

•  Productivity, complexity, programmability

•  Concurrency

•  Memory

•  … how they are approached by different programming models/languages …

•  … including our StarSs proposal

Page 7: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Productivity: complexity, programmability, portability,…

Page 8: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Number of concepts

•  Many of them not related to algorithmic issues. First solution to a given problem.

•  New concept new calls, arguments, interaction between concepts,… that can be used and have to be implemented

•  Implementation overhead/complexity ∝ (man pages, size of summary sheet)

Texture / shared / constant memory; Thread, warp, block, grid CudaMalloc, CudaMemcpy,… _syncthreads

Activities Places Storage classes: Activity-local, Place-local, Partitioned global, Immutable Points, Regions, Distributions, arrays Operations, algebra, restriction, async, foreach, at, ateach finish, atomic, when, clocks (declaration, clocked, next)

X10

Locales (here; var.id, .numCores, .physicalMemory, .locale, …) Domain, subdomain Distribution: Block, cyclic, blockCyc, CSR, DistGPU begin, cobegin, forall, coforall, serial Local, On full/empty, futures, sync Atomic sections

Chapel

Shared / local data Distribution. Pointers: (Resides, points to) x (shared, local) Memory copy/set operations Collective / non collective: memory allocation forall, threadof

UPC

~ 150/~6 calls Communicator Buffered, synchronous Blocking / nonblocking Status, request Short/long protocol Data types Collectives Topology One sided I/O Dynamic process mgmt.

Page 9: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Programming practices: frequent trend towards complexity

•  Enabled by programming model

•  Pushed by performance dazzling

•  Examples

•  RTM Stencil @ VMX

•  30 1000 lines; 20% performance improvement

•  Linpack

•  Include directory: 19 files, 1946 code lines

•  src directory: 113 files, 11945 code lines

•  Parameters: P, Q, Broadcast, lookahead

•  Optimize performance (platform) Distribution: U implementations

Page 10: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Potential to introduce improvements in general language characteristics, compile time checks and optimizations, …

•  Application porting cost: entry barrier; limitations to incremental parallelization, reuse of existing components,…

•  Environment development cost:

•  Complexity of additional compiler.

•  Support for different target platforms: compiler and runtime

!$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1))+(1.0-w(i))*a(i,j) enddo

!$acc end region

Generating copyin(w(2:n-1)), copyin(a(1:n,1:m)), copyout(b(2:n-1,2:m-1)) !$acc do parallel, vector(16) !$acc do parallel, vector(16)

PGI Accelerator

forall(i,j) in D do A(i,j) = i + j/10.0;

A = [(i,j) in D] i + j/10.0;

bigDiff= max reduce [i in InnerD] abs(A(i)-B(i));

Chapel

Page 11: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

// copy result from device to host cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost);

n_workgroups = array_size / (vector_width * local_work_group_size);…; context = clCreateContext(0, (cl_uint) 1, &device_list, NULL, NULL, &rc); cmd_queue = clCreateCommandQueue(context,device_list, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &rc); posix_memalign((void **) &rawbuf, 128, sizeof(cl_double) * 7 * malloc_array_size);…; memobjs[0] = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, memsize, cpflag, &rc);…; bs_source = load_program (kernel_source_file, &rc); program = clCreateProgramWithSource (context, 1, (const char**)(&bs_source), NULL, &rc);

rc = clBuildProgram(program, 0, NULL, kernel_selector, NULL, NULL);…; kernel = clCreateKernel(program, "bsop_kernel", &rc);…; rc = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]);…; rc = bsop_rangeLS(double_flag, array_size, local_work_group_size);

OpenCL

Page 12: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Concurrency & synchronization

Page 13: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  How to generate work

•  Synchronization

•  Hybrid/nested

•  Sequential code is important

•  Is overhead important

•  Malleability

Page 14: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Who generates

•  Application launch

•  Master

•  Nested

•  Flat concurrent

•  Hierarchical

•  Static/dynamic

•  Overhead/Granularity

•  Relation to variables declaration/scope

•  Scalar expansion/privatization

•  Relation to scheduling

Loop / Data parallel

Nested fork-join

SPMD

Futures DAG – data flow

Tasks

OpenMP 3.0 Cilk++

HMPP X10, Chapel

StarSs

OpenMP 2.5 CUDA, OpenCL

Ct MPI, CAF

Page 15: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  “Kernel” based programming models

•  Sometimes (often) perceived as simple, natural.

•  May be limited

BIGF<<<dimGrid, dimBlock>>> smallf( … )

smallf (… )

i= whoami_i(); j= whoami_j(); k= whoami_k(); compute (i,j,k, … );

for (I=0; I<dimGrid[0]; I +=dimBlock[0]) for (J=0; J<dimGrid[1]; J +=dimBlock[1]) for (K=0; K<dimGrid[2]; K +=dimBlock[2]) for (i=I; i<I+dimBlock[0]; i++) for (j=J; j<J+dimBlock[0]; j++) for (k=K; k<K+dimBlock[0]; k++) compute( i, j, k, … );

Many potential interchanges with high performance impact

!$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1))+(1.0-w(i))*a(i,j) enddo

!$acc end region

!$acc do parallel, vector(16) !$acc do parallel, vector(16)

PGI Accelerator

Page 16: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Synchronization

•  Enforce partial order

•  Atomicity

•  Implicit in work generation scheme

•  Barriers at end of parallel

•  Nested fork – join

•  Explicit

•  Point to point

•  Global : clocks, barrier,

SPMD

Page 17: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Propagates perturbation

•  Barriers and pipelines

•  Micro load imbalance

•  Can the synchronization structure absorb it?

•  Special impact in hierarchical parallelism

•  PGAS: separate data transfer and synchronization

•  Move from point to point to global synchronization. Good ?

Page 18: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Asynchrony is crucial

•  to tackle variability

•  To tolerate latency/bandwidth limitations

•  How:

•  MPI: Isend irecv -- wait/watiall

•  HMPP: call -- synchronize

•  X10: async

•  Chapel: sync

•  UPC: upc_notify, upc_wait

•  Need to schedule waits in the source code:

•  Periodic probing during computations generic and totally independent/unaware of ongoing concurrency

•  Synchronize on what order?

Futures

Serialized explicit syncs

Page 19: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Data - flow Asynchrony

•  Graph dynamically generated at run time from execution of sequential program

•  Based on annotated arguments with directionality clauses

void Cholesky( float *A ) { int i, j, k; for (k=0; k<NT; k++) { chol_spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) chol_strsm (A[k*NT+k], A[k*NT+i]); // update trailing submatrix for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) chol_sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); chol_ssyrk (A[k*NT+i], A[i*NT+i]); } }

#pragma css task inout (A[TS][TS]) void chol_spotrf (float *A); #pragma css task input (A[TS][TS]) inout (C[TS][TS]) void chol_ssyrk (float *A, float *C); #pragma css task input (A[TS][TS], B[TS][TS}) inout (C[TS][TS]) void chol_sgemm (float *A, float *B, float *C);

Page 20: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Not to mention algorithmic complexity

float bsop_reference_float (unsigned int cpflag, float S0, float K, float r, float sigma, float T) { float d1, d2, …; d1 = logf(S0/K) + (r + 0.5*sigma*sigma)*T; d1 /= (sigma * sqrt(T));

answer = cpflag ? c : p; return answer;

float4 bsop_ref (unsigned int4 cpflag, float4 S0, float4 K, float4 r, float4 sigma, float4 T) { float4 d1, d2, …; int4 flag1, flag2;

d1 = log(S0/K) + (r + HALF * sigma*sigma)*T; d1 /= (sigma * sqrt(T));

HMPP: code restructuring pragmas to improve “sequential”

Leverage OpenCL kernels from StarSs “Vector” data types & intrinsics

Page 21: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  StarSs version allowing strided and aliased references

•  Extremely simple application

•  Extremely complex dependence detection

void Cholesky (int N, int BS, float A[N][N]) { for (int j = 0; j < N; j+=BS) { for (int k= 0; k< j; k+=BS) for (int i = j+BS; i < N; i+=BS) sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]); for (int i = 0; i < j; i+=BS) ssyrk_tile(BS, N, &A[i][j], &A[j][j]); spotrf_tile(BS, N, &A[j][j]); for (int i = j+BS; i < N; i+=BS) strsm_tile(BS, N, &A[j][j], &A[j][i]);

#pragma css task input(T{0:BS}{0:BS}, BS, N) inout(B{0:BS}{0:BS}) void strsm_tile(integer BS, integer N, float T[N][N], float B[N][N]) { unsigned char LO='L', TR='T', NU='N', RI='R'; float DONE=1.0; integer LDT = sizeof(*T)/sizeof(float); integer LDB = sizeof(*B)/sizeof(float);

void gs (float A[(NB+2)*BS][(NB+2)*BS]) { int it,i,j;

for (it=0; it<NITERS; it++) for (i=0; i<N-2; i+=BS) for (j=0; j<N-2; j+=BS) gs_tile(&A[i][j]); }

#pragma css task \\ input(A{0}{1:BS}, A{BS+1}{1:BS}, \ A{1:BS}{0}, A{1:BS}{BS+1}) \ inout(A{1:BS}{1:BS}) void gs_tile (float A[N][N]) { for (int i=1; i <= BS; i++) for (int j=1; j <= BS; j++) A[i][j] = 0.2*(A[i][j] + A[i-1][j] +

Page 22: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Large overheads may not be a problem

•  If not in the critical path

•  And compensated by huge flexibilities

•  Should overhead drive designs?

•  Of course be considered, but drive ?

Even if ties up a significant fraction of a core, not yet the bottleneck

Acceptable fraction of total time

Significant reduction of task performance The importance of Locality aware scheduling

HUGE absolute value

Page 23: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled
Page 24: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  How to allocate resources at the different levels

•  Number of processes and threads per process

•  Often done independently and for different (local) reasons

•  Strong interaction between schedules

•  Sweep3D example

k

i j

Phijb(i,k,m)

Phikb(i,j,m)

Phiib(j,k,m) Phii(i)

Phi(i)

Page 25: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Hybrid parallelization (128 procs)

•  Speedup SELECTED regions by the CPUratio factor

•  We do need to overcome the hybrid Amdahl’s law

•  asynchrony + Load balancing mechanisms !!!

93.67% 97.49% 99.11%

Code region

%el

apse

d tim

e

Page 26: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Linpack example

•  Overlap communication/computation

•  Extend asynchronous data-flow execution to outer level

•  Automatic lookahead

for (k=0; k<N; k++) { if (mine) { Factor_panel(A[k]); send (A[k])

} else { receive (A[k]); if (necessary) resend (A[k]); }

for (j=k+1; j<N; j++) update (A[k], A[j]);

#pragma css task inout(A[SIZE]) void Factor_panel(float *A); #pragma css task input(A[SIZE]) inout(B[SIZE]) void update(float *A, float *B);

#pragma css task input(A[SIZE]) void send(float *A); #pragma css task output(A[SIZE]) void receive(float *A); #pragma css task input(A[SIZE]) void resend(float *A);

P0 P1 P2

Page 27: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Performance •  High at small problem sizes

•  Improved load balance (less processes)

•  Higher IPC

•  Overlap communication/computation

•  TolerancetobandwidthandOSnoise

Page 28: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Flexible (dynamic) parallelization structure of an application

•  Allows responsiveness to dynamic characteristics of a computation and resource availability

•  Malleability requires

•  Separation between •  Algorithm: Problem logic. Programmer responsibility •  Scheduling: efficiency. System responsibility (hints may help)

•  Frequent control points

•  Issue of:

•  Programming model support

•  Programming practices

Page 29: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

C$OMP PARALLEL WhoAmI=RunTimeCall() myBlock=f(WhoAmI)

Call Compute1(myBlock)

DO iters=1, #iters Call Compute2(myBlock) END DO

C$OMP END PARALLEL

C$OMP PARALLEL DO DO Block=1, #blocks Call Compute1(Block) ENDO C$OMP END PARALLEL … DO iter=1,#iters C$OMP PARALLEL DO DO Block=1, #blocks Call Compute1(Block) ENDO C$OMP END PARALLEL END DO … C$OMP END PARALLEL

Scheduling decisions: Once for all

Explicit code only related to parallelism

myBlock ∈ [ 1,#processors] ← f(resources) Block ∈ [ 1,#blocks] ← f(algorithm)

Page 30: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Dependences: not all arguments in directionality clause

Different implementations

Heterogeneous devices

Separation dependences/transfers

Page 31: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Memory

Page 32: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  is THE problem

•  16 32 64 bits: A simple thing for a computer architect that has caused huge pain for everybody !!!!

•  Lessons from history: Virtual memory:

•  Decoupling logical and physical, automatically handled, programmer awareness at a quite abstract/simple level.

•  took 10 years to be accepted

Page 33: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Capacity: power, use?

•  Structure: Hierarchy

•  Transfers

•  Coherence / consistency

•  Association

Road Runner GPU

Page 34: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Moving from explicit API calls to automatic

float* h_A = (float*) malloc(mem_size_A);

float* d_A; float* d_B; float* d_C;

cudaMalloc((void**) &d_A, mem_size_A);

cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice);

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost);

Page 35: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Avoiding replicated transfers, overlapping,..

Page 36: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  2D (nD) logical structure 1D “physical” structure

•  Objective: support efficient translation object name to storage container

•  “linear”

•  Colum/row major order: by now is probably in our genes.

•  Knowledge of how to next loops to avoid paging, cache misses,… •  Hardwired in widely used libraries: i.e. BLAS.

•  Blocked

•  Sequential

•  Parallel

•  Data distribution/layout (HPF, UPC,….)

•  Impacts performance: address generation, locality (distributed memory, cache), … •  Often fixes schedules (owner computes) •  Typically too early (declaration time) and static

•  “Convenient” association may vary with program phase explicit reassociations

•  Workarrays •  Data redistribution

Page 37: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  A very important issue

•  Dependence on number of processors?

•  Predictability of performance?

•  Automatic handling of “convenient/possible” association

•  Transformation of address generation

Shared [2] int A[4][5] Shared [2][2] int A[4][5]

Page 38: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Automatic association management

•  Workarrays & Reshaping

void Cholesky (int N, int BS, float A[N][N]) { for (int j = 0; j < N; j+=BS) { for (int k= 0; k< j; k+=BS) for (int i = j+BS; i < N; i+=BS) sgemm_tile(BS, N, &A[k][i], &A[k][j], &A[j][i]); for (int i = 0; i < j; i+=BS) ssyrk_tile(BS, N, &A[i][j], &A[j][j]); spotrf_tile(BS, N, &A[j][j]); for (int i = j+BS; i < N; i+=BS) strsm_tile(BS, N, &A[j][j], &A[j][i]); } }

#pragma css task input(T{0:BS}{0:BS}, BS, N) inout(B{0:BS}{0:BS}) void strsm_tile(integer BS, integer N, float T[N][N], float B[N][N]) { unsigned char LO='L', TR='T', NU='N', RI='R'; float DONE=1.0; integer LDT = sizeof(*T)/sizeof(float); integer LDB = sizeof(*B)/sizeof(float); strsm_(&RI, &LO, &TR, &NU, &BS, &BS, &DONE, T, &LDT, B, &LDB);

Using MKL kernels/tiles

SMPS - MUMA

SMPS interleaved memory allocation

SMPS first touch serial initialization

MKL 10.1

Page 39: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  The potential of

•  Data distribution

•  Dynamic association management

•  Automatically achieved

Reshaped Regions

Regions Data distributed

across nodes

Regions

NUMA-aware Reshaped

Regions

IPC histogram

Page 40: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

Invitation

Page 41: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Towards EXaflop applicaTions

•  Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a an productive and efficient way.

•  Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC

•  Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms.

•  Develop additional environment capabilities

•  tools (debug, performance)

•  improvements in runtime systems (load balance and GPUSs)

•  Support other users

•  Identify users of TEXT applications

•  Identify and support interested application developers

•  Contribute to Standards (OpenMP ARB, PERI-XML)

Page 42: Issues in parallel languages for future HPC systems · 2016. 8. 9. · •efficiency driven • In power management which implies efficient usage of resources • … strong scaled

•  Language

•  Transport / integrate information

•  Experience Good “predictor”

•  Structures mind

•  We probably all agree in the general objectives, desirable characteristics, …

•  Portability, abstraction, flexibility, modularity, locality,...

•  We may differ in details …

•  …although the difference between “detail” and “fundamental” is often subtle, subjective,…

• … and details are extremely important

• Very “simple/small/subtle” differences very important impact

Neanderthal high larynx

H. sapiens low larynx

Wish we were able to develop language:

Communicated to humans Express Ideas Exploitable by machines

Separate

Algorithm, Ideas

Performance optimization, hints