PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen...

Preview:

Citation preview

PARRAY: The Array-Based GPU Programming Technology

Yifeng Chen

School of EECSPeking University, China.

Two Conflicting Approaches for Programmability in HPC

Top-down ApproachCore programming model is high-level (e.g. func parallel lang)Must rely on heavy heuristic runtime optimizationAdd low-level program constructs to improve low-level controlRisks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY)Core programming model exposes the memory hierarchySame algorithm, Same performance, Same intellectual challenge, but Shorter codeRuntime optimization possible, but not part of the core model.

Basic Notation

• Dimensions in a tree• A dimension may refer to another array type.

Motivating Examples for PARRAY

Thread Arrays

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{

#create P(p)#create H(host)#detour P(p) {

float* dev;INIT_GPU($tid$);#create D(dev)#insert DataTransfer(dev, G, host, H){}

}#destroy H(host)#destroy P(p)

}

pthread_create

sem_post

sem_wait

pthread_join

Generating CUDA+Pthread

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{#create M(m)#create H(host)#detour M(m) {

float* dev;#create H_1(dev)#insert DataTransfer(dev, G, host, H){}

}#destroy H(host)#destroy M(m)

}

Generating MPI or IB/verbs

MPI_Scatter

ALLTOALL

BCAST

Other Communication Patterns

One-Line CUDA Code

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows

ScaleUp to 14336 3D Single-Precision12 distributed arrays, each with 11 TB data (128TB total)Entire Tianhe-1A with 7168 nodes

Progress4096 3D completed8192 3D half-way and 14336 3D tested for performance.

Software TechnologiesPARRAY (ACM PPoPP’12) code only 300 lines.Programming-level resilience technology for stable computation Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.

DiscussionsCan other programming models benefit from PARRAY ideas?

MPI (more expressive datatype)OpenACC (optimization for coalescing accesses)PGAS (generating PGAS library calls)IB/verbs (directly generating Zero-Copy IB calls)

PARRAY helps, if you can write it down!Any index expressions using add/mul/mod/divIrregular structures must be encoded into arrays and then benefit from PARRAY. Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macrosMacros are compiled out: no performance loss.Typical training = 3 days, Friendly to engineers, geophysicists…

Recommended