VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne

VORPAL Optimizations for Petascale Systems

Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala

Tech-X Corporation

Boyana NorrisArgonne National Lab

*[email protected]

*[email protected]

Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731

mailto:*[email protected]

mailto:*[email protected]

VORPAL: Introduction

• VORPAL plasma framework widely used on leadership-class systems– Particle-in-cell (PIC) algorithm for kinetic plasma model– Finite-Difference Time-Domain (FDTD) for Maxwell solver– Access to various libraries (Trilinos, PETSc) for doing linear

system solves (Electrostatic PIC)– ADI methods for doing implicit Maxwell solve

• Self-consistent model for charged particles and electromagnetic field– Electromagnetic field discretized on 3D cartesian mesh– Particles located anywhere in space– Particles gather forces and scatter charges/currents to the

field• Parallelization via domain decomposition

VORPAL Optimization Challenges

• Petascale systems require strong scaling– FDTD computationally cheap to start with– Need good performance on small domain

Þ Efficient messaging on small computational domainsÞ Efficient computation and messaging for heterogeneous

architectures

• If particles present, they are the main contribution to compute time– Significant time savings via optimized particle-push algorithm– Key challenge: finding good data layout

Þ Optimize nearest neighbor messaging for small domain sizes

Þ Optimize for heterogeneous architectures … GPUs

JumpShot Messaging Patterns in an FDTD Simulation

PETSc on BG/L VORPAL on BG/P

Improving parallel efficiency with different messaging patterns

Conventional field messaging: Send and

receive messages to/from all neighbors at once

Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next

Reduces overall number of messages: 6 instead of 26

Messaging results

• Staged messaging can be up to 5× faster for small domain sizes

• Similar performance on Cray XT4 and BG/P

Cray XT4 BG/P

Effect of E/B Field Memory Structure on FDTD Performance

Ex(i,j,k)

Ey(i,j,k+1)

Ey(i,j,k)

Ez(i,j,k)

Ex(i,j,k+1)

Ez(i,j,k+1)

Ex(i,j,k)

Ex(i,j,k+1)

Ey(i,j,k)

Ey(i,j,k+1)

Ez(i,j,k)

Ez(i,j,k+1)

…

…

…

Layout A Layout B

…

…

…

Memory Layout is a key consideration for GPU optimization

Using GPUs for Accelerating FDTD

• FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops)

• FERMI available in Spring 2010 ???

GPU Implementations of FDTDImplementation 1: Generic 4 pt

Stencil Kernelsfor(int i = tid; i < nx; i +=

nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; }

Implementation 2: Yee Mesh Specific Kernels

float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); }

• Requires 6 calls to Generic Kernel to do the full FDTD update

• Makes no use of GPU memory hierarchy

• All memory accesses are global

• Corresponding call to Ampere update• Reuse 3 global memory accesses• no use of shared memory or other

trickery

Timings: FDTD Performance

Speedup

Specifics

• Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant

• Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel.

• Collection of optimized vector routines that can perform all of the above mentioned algorithms – Accessible from within VORPAL or from High-

Level Languages like IDL and MATLAB– See Peter Messmer’s dinner-time presentation

on GPULib at the NUG meeting (Wed. 10/7)

GPULib (https://gpulib.txcorp.com)

Future Work

• Fully optimize the FDTD algorithm• Move to multiple GPUs so that VORPAL can

take advantage of new heterogeneous systems

• VORPAL ADI algorithm on GPUs: – GPULib has highly optimized tridiagonal and

pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation.

• Move Particle Push to GPU• Vlasov Solver??

Documents

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne