View
975
Download
0
Category
Tags:
Preview:
Citation preview
Directive-based approach to heterogeneouscomputing
Ruyman Reyes Castro
High Performance Computing GroupUniversity of La Laguna
December 19, 2012
TOP500 Performance Development List
2 / 83
Applications Used in HPC Centers
Usage of HECToR by Area of Expertise
3 / 83
Real HPC Users
Most Used Applications in HECToR
Application % of total jobs Language Prog. ModelVASP 17% Fortran MPI+OpenMPCP2K 7% Fortran MPI+OpenMP
Unified Model (UM) 7% Fortran MPIGROMACS 4% C++ MPI+OpenMP
I Large code-bases
I Complex algorithms implemented
I Mixture of different Fortran flavours
4 / 83
Knowledge of Programming
Survey conducted in the Swiss National Supercomputing Centre(2011)
5 / 83
Are application developers usingthe proper tools?
6 / 83
Complexity Arises (I)
7 / 83
Directives: Enhancing Legacy Code (I)
OpenMP Example
1 ...
2 #pragma omp parallel for default(shared) private(i, j)
firstprivate(rmass, dt)
3 for (i = 0; i < np; i++) {
4 for (j = 0; j < nd; j++) {
5 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5 * dt*dt*a[i][j];
6 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
7 a[i][j] = f[i][j] * rmass;
8 }
9 }
10 ...
8 / 83
Complexity Arises (II)
9 / 83
Re-compiling the code is no longer enough tocontinue improving the performance
10 / 83
Porting Applications To New Architectures
Programming CUDA (Host Code)
1 float a_host[n], b_host[n];
2 // Allocate
3 cudaMalloc((void*)&a, n * sizeof(float));
4 cudaMalloc((void*)&b, n * sizeof(float));
5 // Transfer
6 cudaMemcpy(a, a_host, n * sizeof(float), cudaMemcpyHostToDevice);
7 cudaMemcpy(b, b_host, n * sizeof(float), cudaMemcpyHostToDevice);
8 // Define grid shape
9 blocks = 100
10 threads = 128
11 // Execute
12 kernel<<<blocks,threads>>>(a, b, c);
13 // Copy-back
14 cudaMemcpy(a_host, a, n * sizeof(float), cudaMemcpyDeviceToHost);
15 // Clean
16 cudaFree(a);
17 cudaFree(b);
11 / 83
Porting Applications To New Architectures
Programming CUDA (Kernel Source)
1 // Kernel code
2 __global__ void kernel(float *a, float *b, float c)
3 {
4 // Get the index of this thread
5 unsigned int index = (blockIdx.x * blockDim.x) + threadIdx.x;
6 // Do the computation
7 b[index] = a[index] * c;
8 // Wait for all threads in the block to finish
9 __syncthreads();
10 }
12 / 83
Programmers need faster ways to migrate existing code
13 / 83
Why not use directive-based approaches forthese new heterogeneous architectures?
14 / 83
Overview of Our Work
We can’t solve problems by using the same kind ofthinking we used when we created them.
Albert Einstein
The field is undergoing rapid changes: we have to adapt tothem
1. Hybrid MPI+OpenMP (2008)→ Usage of directives in cluster environments
2. OpenMP extensions (2009)→ Extensions of OpenMP/La Laguna C (llc) forheterogeneous architectures
3. Directives for accelerators (2011)→ Specific accelerator-oriented directives→ OpenACC (December 2011)
15 / 83
Outline
Hybrid MPI+OpenMPllc and llCoMP
Hybrid llCoMP
Computational ResultsTechnical Drawbacks
OpenMP-to-GPU
Directives for Accelerators
Conclusions
Future Work and Final Remarks
La Laguna C: llc
What is
I Directive-based approach to distributed memory environments
I OpenMP compatible
I Additional set of extensions to address particular features
I Implemented FORALL loops, Pipelines, Farms . . .
Reference[48] Dorta, A. J. Extension del modelo de OpenMP a memoriadistribuida. PhD Thesis, Universidad de La Laguna, December 2008.
17 / 83
Chronological Perspective (Late 2008)
Cores per Socket - System Share Accelerator - System Share
18 / 83
A Hybrid OpenMP+MPI Implementation
Same llc code, extended llCoMP implementation
I Directives are replaced by a set of parallel patterns
I Improved performance on multicore systems→ Better usage of inter-core memories (i.e cache)→ Lower memory requirements when using replicated memoryon MPI
Translation
19 / 83
llc Code Example
llc Implementation of the Mandelbrot Set Computation
1 ...
2 #pragma omp parallel for default(shared) reduction(+:numoutside)
private(i, j, ztemp, z) shared(nt, c)
3 #pragma llc reduction_type (int)
4 for(i = 0; i < npoints; i++) {
5 z.creal = c[i].creal; z.cimag = c[i].cimag;
6 for (j = 0; j < MAXITER; j++) {
7 ztemp = (z.creal*z.creal) - (z.cimag*z.cimag)+c[i].creal;
8 z.cimag = z.creal * z.cimag * 2 + c[i].cimag;
9 z.creal = ztemp;
10 if (z.creal * z.creal + z.cimag * z.cimag > THRESOLD) {
11 numoutside++;
12 break;
13 }
14 }
15 ...
20 / 83
Hybrid MPI+OpenMP performance
21 / 83
Technical Drawbacks
llCoMP
I The original design of llCoMP StS was not flexible enough
I Traditional two-pass compiler
I Excessive effort to implement new features
I Need more advanced features to implement GPU codegeneration
22 / 83
Back to the Drawing Board
23 / 83
Outline
Hybrid MPI+OpenMP
OpenMP-to-GPURelated WorkYet Another Compiler Framework (YaCF)
Computational ResultsTechnical Drawbacks
Directives for Accelerators
Conclusions
Future Work and Final Remarks
Chronological Perspective (Late 2009)
Cores per Socket - System Share Accelerator - System Share
25 / 83
Related Work
Other OpenMP-to-GPU translators: OpenMPC[82] Lee, S., and Eigenmann, R. OpenMPC: Extended OpenMP programmingand tuning for GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE conferenceon Supercomputing. IEEE Computer Society, pp. 1–11.
Other Compiler Frameworks: Cetus, LLVM[84] Lee, S., Johnson., T. A. and Eigenmann, R. Cetus – an extensible compilerinfrastructure for source-to-source transformation. In Languages and Compilersfor Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, volume2958 of LNCS(2003), pp. 539-553.
[81] Lattner, C., and Adve, V. LLVM: A compilation framework for lifelongprogram analysis & transformation. In Proceedings of the internationalsymposium on Code generation and optimization: feedback-directed and runtimeoptimization, CGO’04. IEEE Computer Society, pp. 75–47.
26 / 83
YaCF: Yet Another Compiler Framework
Application programmer writes llc code
I Focus on data and algorithm
I Architecture independent
I Only needs to specify where the parallelism is
System engineer writes template code
I Focus on non-functional code
I Can reuse code from different patterns (i.e inheritance)
27 / 83
YaCF Software Architecture
28 / 83
Main Software Design Patterns
Implementing search and replacement in the IR
I Filter: Looks for an specific pattern on the IR→ E.g Looks for a pragma omp parallel construct
I Mutator: Looks for a node and transforms the IR→ E.g Applies loop transformations (nesting, flattening, . . . )→ E.g Replaces a pragma omp for by a CUDA kernel call
I Can be composed to solve more complex problems
29 / 83
Dynamic Language and Tools
Key Idea: Features Should Require Only a Few Lines of Code
30 / 83
Template Patterns
Ease back-end implementation
1 <%def name="initialization(var_list, prefix = ’’, suffix = ’’)">
2 %for var in var_list:
3 cudaMalloc((void **) (&${prefix}${var.name}${suffix}),
4 ${var.numelems} * sizeof(${var.type}));
5 cudaMemcpy(${prefix}${var.name}${suffix}, ${var.name},
6 ${var.numelems} * sizeof(${var.type}),
7 cudaMemcpyHostToDevice);
8 %endfor
9 </%def>
31 / 83
CUDA Back-end
Generates a CUDA kernel and memory transfers from theinformation obtained during the analysis
Supported syntax
I parallel, for and their condensed form implemented
I New directives to support manual optimizations (e.ginterchange)
I Syntax taken from an OpenMP proposal by BSC, UJI andothers (#pragma omp target)
I copy in, copy out enable users to provide memory transferinformation
I Generated code is human-readable
32 / 83
Example
Update Loop from the Molecular Dynamics Code
1 ...
2 #pragma omp target device(cuda) copy(pos, vel, f) copy_out(a)
3 #pragma omp parallel for default(shared) private(i, j)
firstprivate(rmass, dt)
4 for (i = 0; i < np; i++) {
5 for (j = 0; j < nd; j++) {
6 pos[i][j] = pos[i][j] + vel[i][j]*dt + 0.5*dt*dt*a[i][j];
7 vel[i][j] = vel[i][j] + 0.5*dt*(f[i][j]*rmass + a[i][j]);
8 a[i][j] = f[i][j] * rmass;
9 }
10 }
33 / 83
Translation process
34 / 83
The Jacobi Iterative Method
1 error = 0.0;
2
3
4 {
5
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
13 error += resid * resid;
14 }
15 }
16 }
17 k++;
18 error = sqrt(error) / (double) (n * m);
35 / 83
Jacobi OpenMP Source
1 error = 0.0;
2
3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9 #pragma omp for reduction(+:error)
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
13 error += resid * resid;
14 }
15 }
16 }
17 k++;
18 error = sqrt(error) / (double) (n * m);
36 / 83
Jacobi llCoMP v1
1 error = 0.0;
2 #pragma omp target device(cuda)
3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9 #pragma omp for reduction(+:error)
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
13 error += resid * resid;
14 }
15 }
16 }
17 k++;
18 error = sqrt(error) / (double) (n * m);
37 / 83
Jacobi llCoMP v2
1 error = 0.0;
2 #pragma omp target device(cuda) copy_in(u, f) copy_out(f)
3 #pragma omp parallel shared(uold, u, ...) private(i, j, resid)
4 {
5 #pragma omp for
6 for (i = 0; i < m; i++)
7 for (j = 0; j < n; j++)
8 uold[i][j] = u[i][j];
9 #pragma omp for reduction(+:error)
10 for (i = 0; i < (m - 2); i++) {
11 for (j = 0; j < (n - 2); j++) {
12 resid = ...
13 error += resid * resid;
14 }
15 }
16 }
17 k++;
18 error = sqrt(error) / (double) (n * m);
38 / 83
Jacobi Iterative Method
39 / 83
Technical Drawbacks
Limited to Compile-time Optimizations
I Some features require runtime information→ Kernel grid configuration
I Orphaned directives were not possible→ Would require an inter-procedural analysis module
I Some templates were too complex→ And would need to be replicated to support OpenCL
40 / 83
Back to the Drawing Board
41 / 83
Outline
Hybrid MPI+OpenMP
OpenMP-to-GPU
Directives for AcceleratorsRelated WorkOpenACCAccelerator ULL (accULL)
Results
Conclusions
Future Work and Final Remarks
Chronological Perspective (2011)
Cores per Socket - System Share Accelerator - System Share
43 / 83
Related Work (I)
hiCUDA
I Translates each directive into a CUDA call
I It is able to use the GPU Shared Memory
I Only works with NVIDIA devices
I The programmer still needs to know hardware details
Code Example:
1 ...
2 #pragma hicuda global alloc c [*] [*] copyin
3
4 #pragma hicuda kernel mxm tblock(N/16, N/16) thread(16, 16)
5 #pragma hicuda loop_partition over_tblock over_thread
6 for (i = 0; i < N; i++) {
7 #pragma hicuda loop_partition over_tblock over_thread
8 for (j = 0; j < N; j++) {
9 double sum = 0.0;
10 ...44 / 83
Related Work (II)
PGI Accelerator Model
I Higher level (directive-based) approach
I Fortran and C are supported
Code Example:
1 #pragma acc data copyin(b[0:n*l], c[0:m*l]) copy(a[0:n*m])
2 {
3 #pragma acc region
4 for (j = 0; j < n; j++)
5 for (i = 0; i < l; i++) {
6 double sum = 0.0;
7 for (k = 0; k < m; k++)
8 sum += b[i + k * l] * c[k + j * m];
9 a[i + j * l] = sum;
10 }
11 }
45 / 83
Our Ongoing Work at that Time: llcl
I Extending llc with support for heterogeneous platforms
I Compiler + Runtime implementation→ The Compiler generates runtime code→ The Runtime handles memory coherence and drivesexecution
I Compiler optimizations directed by an XML file
I More generic/higher level approach - not tied to GPUs
46 / 83
llcl: Directives
1 double *a, *b, *c;
2 ...
3 #pragma llc context name("mxm") copy_in(a[n * l], b[l * m], \
4 c[m * n], l, m, n) copy_out(a[n * l])
5 {
6 int i, j, k;
7 #pragma llc for shared(a, b, c, l, m, n) private(i, j, k)
8 for (i = 0; i < l; i++)
9 for (j = 0; j < n; j++) {
10 a[i + j * l] = 0.0;
11 for (k = 0; k < m; k++)
12 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
13 }
14 }
15 ...
47 / 83
llcl: XML Platform Description File
1 <xml>
2 <platform name="default">
3 <region name="compute">
4 <element name="compute_1" class="loop">
5 <mutator name="Loop.LoopInterchange"/>
6 <target device="cuda"/>
7 <target device="opencl"/>
8 </element>
9 </region>
10 </platform>
11 </xml>
48 / 83
OpenACC Announcement
49 / 83
OpenACC Announcement
50 / 83
OpenACC: Directives
1 double *a, *b, *c;
2 ...
3 #pragma acc data copy_in(a[n * l],b[l * m],c[m * n], l, m, n)
copy_out(a[n * l])
4 {
5 int i, j, k;
6 #pragma acc kernels loop private(i, j, k)
7 for (i = 0; i < l; i++)
8 for (j = 0; j < n; j++) {
9 a[i + j * l] = 0.0;
10 for (k = 0; k < m; k++)
11 a[i + j * l] = a[i + j * l] + b[i + k * l] * c[k + j * m];
12 }
13 }
14 ...
51 / 83
Related Work
OpenACC Implementations (After Announcement)
I PGI - Released on February 2012
I CAPS - Released on March 2012
I Cray - To be released→ Access to beta release available
We had a first experimental implementation in January 2012
52 / 83
accULL: Our OpenACC Implementation
accULL = YaCF + Frangollo
It is a two-layer based implementation:Compiler + Runtime Library
53 / 83
Frangollo: the Runtime
Implementation
I Lightweight
I Standard C++ and STL code
I CUDA component written using the CUDA Driver API
I OpenCL component written using the C OpenCL interface
I Experimental features can be enabled/disabled at compile time
Handles
1. Device discovery, initialization, . . .
2. Memory coherence (registered variables)
3. Manage kernel execution (including grid shape)
54 / 83
Frangollo Layered Structure
55 / 83
Memory Management
1 // Creates a context to handle memory coherence
2 ctxt_id = FRG__createContext("name", ...)
3 ...
4 // Register a variable within the context
5 FRG__registerVar(ctxt_id, &ptr, offset, size, constraints, ...);
6 ...
7 // Execute the kernel
8 FRG__kernelLaunch(ctxt_id, "kernel", param_list, ...)
9 ...
10 // Finish the context and concyle variables
11 FRG__destroyContext(ctxt_id);
56 / 83
Kernel Execution
Loading the kernel
I Context may have from zero to N named kernels associated
I Runtime loads different versions of the kernel for each device
I Kernel is loaded depending on the platform where it is executed
Grid shape
I Grid shape is estimated using compute intensity (CI):Nmem/(Cost ×Nflops)→ E.g Fermi, GFlops DP 512GFlop/s, Memory Bandwidth144Gb/s, Cost 3.5
I Low CI → favors memory accesses
I High CI → favors computation
57 / 83
Implementing OpenACC
Putting all together
1. The compiler driver generates Frangollo interface calls fromOpenACC directives→ Converts data region directives into context creation→ Generates Host and Device synchronization
2. Extracts the kernel code
3. Frangollo implements OpenACC API calls→ acc init, acc malloc/acc free
4. Implements some optimizations→ Compiler: loop invariant, skewing, strip-mining, interchange→ Kernel extraction: divergence reduction, data-dependencyanalysis (basic)→ Runtime: grid shape estimation, optimized reduction kernels
58 / 83
Building an OpenACC Code with accULL
59 / 83
Compilance with OpenACC Standard
Table: Compliance with the OpenACC 1.0 standard (directives)
Construct Supported bykernels PGI, HMPP, accULLloop PGI, HMPP, accULL
kernels loop PGI, HMPP, accULLparallel PGI, HMPPupdate Implemented
copy, copyin, copyout, . . . PGI, HMPP, accULLpcopy, pcopyin, pcopyout ,. . . PGI, HMPP, accULL
async PGIdeviceptr clause PGI
host accULL
collapse accULL
Table: Compliance with the OpenACC 1.0 standard (API)
API Call Supported byacc init PGI, HMPP, accULL
acc set device PGI, HMPP, accULL(no effect)acc get device PGI, HMPP, accULL
60 / 83
Experimental Platforms
Garoe: A Desktop computer
I Intel Core i7 930 processor (2.80 GHz), 4Gb RAMI 2 GPU devices attached:
I Tesla C1060I Tesla C2050 (Fermi)
Peco: A cluster node
I Peco: 2 quad core Intel Xeon E5410 (2.25GHz) processors,24Gb RAM
I Attached a Tesla C2050 (Fermi)
Drago: A shared memory system
I 4 Intel Xeon E7 4850 CPU, 6Gb RAM
I Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU 61 / 83
Software
Compiler versions (Pre-OpenACC)
I PGI Compiler Toolkit 12.2 with the PGI AcceleratorProgramming Model 1.3
I hiCUDA: 0.9
Compiler versions (OpenACC)
I PGI Compiler Toolkit 12.6
I CAPS HMPP: 3.2.3
62 / 83
Matrix Multiplication (M ×M) (I)
1 #pragma acc data name("mxm") copy(a[L*N]) copyin(b[L*M], c[M*N])
2 {
3 #pragma acc kernels loop private(i, j) collapse(2)
4 for (i = 0; i < L; i++)
5 for (j = 0; j < N; j++)
6 a[i * L + j] = 0.0;
7 /* Iterates over blocks */
8 for (ii = 0; ii < L; ii += tile_size)
9 for (jj = 0; jj < N; jj += tile_size)
10 for (kk = 0; kk < M; kk += tile_size) {
11 /* Iterates inside a block */
12 #pragma acc kernels loop collapse(2) private(i,j,k)
13 for (j = jj; j < min(N, jj+tile_size); j++)
14 for (i = ii; i < min(L, ii+tile_size); i++)
15 for (k = kk; k < min(M, kk+tile_size); k++)
16 a[i*L+j] += (b[i*L+k] * c[k*M+j]);
17 }
18 }
63 / 83
Floating Point Performance for M×M in Peco
64 / 83
M×M (II)
1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N])
2 {
3 #pragma acc kernels loop private(i)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
8 /* Iterates over blocks */
9 for (ii = 0; ii < L; ii += tile_size)
10 for (jj = 0; jj < N; jj += tile_size)
11 for (kk = 0; kk < M; kk += tile_size) {
12 /* Iterates inside a block */
13 #pragma acc kernels loop private(i)
14 for (j = jj; j < min(N, jj+tile_size); j++)
15 #pragma acc loop private(j)
16 for (i = ii; i < min(L, ii+tile_size); i++)
17 for (k = kk; k < min(M, kk+tile_size); k++)
18 a[i * L + j] += (b[i * L + k] * c[k * M + j]);
19 }
20 }
65 / 83
M×M (III)
1 #pragma acc data copy(a[L*N]) copyin(b[L*M], c[M*N] ...)
2 {
3 #pragma acc kernels loop private(i) gang(32)
4 for (i = 0; i < L; i++)
5 #pragma acc loop private(j) worker(32)
6 for (j = 0; j < N; j++)
7 a[i * L + j] = 0.0;
8 /* Iterates over blocks */
9 for (ii = 0; ii < L; ii += tile_size)
10 for (jj = 0; jj < N; jj += tile_size)
11 for (kk = 0; kk < M; kk += tile_size) {
12 /* Iterates inside a block */
13 #pragma acc kernels loop private(i) gang(32)
14 for (j = jj; j < min(N, jj+tile_size); j++)
15 #pragma acc loop private(j) worker(32)
16 for (i = ii; i < min(L, ii+tile_size); i++)
17 for (k = kk; k < min(M, kk+tile_size); k++)
18 a[i*L+j] += (b[i*L+k] * c[k*M+j]);
19 }
20 }
66 / 83
About Grid Shape and Loop Scheduling Clauses
Optimal gang/worker (i.e, grid shape) values vary
I Among OpenACC implementations
I Among Platforms (Fermi vs Kepler?, NVIDIA vs ATI?)
I What happens if we implement a non-GPU accelerator?
I Our implementation ignores gang/worker, leaves decision toruntime→ User can influence the decision with an environment variable
I It is possible to enable the gang/worker clauses in ourimplementation→ Gang/worker feeds a Strip-mining transformation forcingblock/threads (WIP)
67 / 83
Effect of Varying Gang/Worker
68 / 83
OpenMP vs Frangollo+OpenCL in Drago
69 / 83
Needleman-Wunsch (NW)
I NW is a nonlinear global optimization method for DNAsequence alignments
I The potential pairs of sequences are organized in a 2D matrix
I The method uses Dynamic Programming to find the optimumalignment
70 / 83
Performance Comparison of NW in Garoe
71 / 83
Overall Comparison
72 / 83
Outline
Hybrid MPI+OpenMP
OpenMP-to-GPU
Directives for Accelerators
Conclusions
Future Work and Final Remarks
Directive-based Programming
I Support for accelerators in the OpenMP standard may beadded in the future→ In the meantime, OpenACC can be used to port codes toGPUs→ It is possible to combine OpenACC with OpenMP
I Generated code does not always match native-codeperformance→ But leverages the development effort providing enoughperformance
I accULL is an interesting research-oriented implementation ofOpenACC→ First non-commercial OpenACC implementation→ It is a flexible framework to explore optimizations, newplatforms, . . .
74 / 83
Outline
Hybrid MPI+OpenMP
OpenMP-to-GPU
Directives for Accelerators
Conclusions
Future Work and Final Remarks
Back to the Drawing Board?
76 / 83
accULL Still Has Some Opportunities
I Study support for multiple devices (either transparently or inOpenACC)
I Design an MPI component for the runtime
I Integration with other projects
I Improve the performance of the generated code (e.g usingPolyhedral models)
I Enhance the support for Extrae/Paraver (experimental tracingalready built-in)
77 / 83
Re-use our Know-how
Integrate OpenACC and OMPSs?
I Current OMPSs implementation does not automaticallygenerate kernel code
I Integrating OpenACCsyntax within tasks would enableautomatic code generation
I Improve portability in accelerator platforms
I Leverage development effort
78 / 83
Contributions
I Reyes, R. and de Sande, F. Automatic code generation for GPUs inllc. The Journal of Supercomputing 58, 3 (Mar. 2011), pp.349-356.
I Reyes, R. and de Sande, F. Optimization stategies in different CUDAarchitectures using llCoMP. Microprocessors and Microsystems -Embedded Hardware Design 36, 2 (Mar. 2012), pp. 78-87.
I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. accULL: anOpenACC implementation with CUDA and OpenCL support. InEuro-Par 2012 Parallel Processing - 18th International Conference,vol. 7484 of LNCS, pp. 871-882.
I Reyes, R., Fumero, J. J., Lopez, I. and de Sande, F. A PreliminaryEvaluation of OpenACC Implementations. The Journal ofSupercomputing (In Press)
79 / 83
Other contributions
I accULL has been released as an Open Source Project→ http://cap.pcg.ull.es/accull
I accULL is currently being evaluated by VectorFabrics
I Provided feedback to CAPS which seems to be used in theircurrent version
I Contacted by members of the OpenACC committee
I Two HPC-Europa2 visits by our team master students
80 / 83
Acknowledgements
I Spanish MECPlan Nacional de I+D+i, contracts TIN2008-06570-C04-03and TIN2011-24598
I Canary Islands Government ACIISIContract SolSubC200801000285
I TEXT Project (FP7-261580)
I HPC-EUROPA2 (project number: 228398)
I Universitat Jaume I de Castellon
I Universidad de La Laguna
I All members of GCAP
81 / 83
Thank you for your attention!
82 / 83
Directive-based approach to heterogeneouscomputing
Ruyman Reyes Castro
High Performance Computing GroupUniversity of La Laguna
December 19, 2012
Recommended