Panini: A GPU Aware Array Class - GPU Technology

Panini: A GPU aware Array class Dr. Santosh Ansumali(JNCASR) & Priyanka Sah (NVIDIA)

http://www.gputechconf.com/page/home.html

Heterogeneous Computing

CPU

— Multicore

— Multiprocessor

— Cluster of Multicore

GPU

— CPU + GPU

MIC

Background

Programing Efficiency

Performance

— Scalar

— Parallel

MATLAB/ STL-C++

Merit - Easy to write a code using

Demerit - Performance Issue – no scalability of code

Array (C/C++) – smarter way of writing code

Use the feature of object oriented Matlab and Template Meta programming

Background

Template Meta programming

— Blitz++ : Fast, Accurate Numerical Computing in C++

Advanced feature (vector, array ,matrix)

FORTRAN

MATLAB

Scalable

Disadvantage – large code

POOMA –MPI

Panini – part of MATLAB + Blitz Vector Initialization on MATLAB a=1,2,3

Expression Evaluation a=αb + βc

MATLAB Blitz way

T1[i]=βc[i] lazy evaluation

T2[i]=αb[i] a[i]= α*b[i] +β*c[i]

T3[i]=t1[i]+t2[i] A=t3[i]

— Small array double a[20] - complete loop unrolling

— Large array double *a – if we do loop unrolling - register spilling

a=αb +βc +µd type checking

CRTP –to avoid type checking without going to virtual function

bsum(a,b) –{ &a[i] , &b[i] } a+b+c bsumof X + X -> tuplesum

Vector Initialization

Type Checking

template< class lhs, typename dataType> class scalarMultR: public baseET<dataType, scalarMultR<lhs,dataType> > { public: __device__ scalarMultR(const lhs &l,const dataType & r): lhs_(l),rhs_(r){} __device__ dataType value(int i)const{ return lhs_.value(i) * rhs_; } private: const lhs &lhs_; const dataType rhs_; };

template< class T, int len> class commaHelper { public: __device__ commaHelper() : vPtr(0){} __device__ commaHelper(T * ptr) : vPtr(ptr) { } __device__ commaHelper & operator,(T val) { *vPtr++ = val; return *this; } private: T * vPtr; };

__device__ commaHelper<dataType, N> operator=(dataType val ) { for(int i =0; i< numELEM; i++) data[i] = val; return commaHelper<dataType, N>(&data[1]); }

Objectives

Programmer productivity

— Rapidly develop complex applications

— Leverage parallel primitives

Encourage generic programming

High performance

— With minimal programmer effort

Interoperability

— Integrates with CUDA C/C++ code

Panini Library

Generic parallel array class built on advanced generic programming methodologies where details of parallelization is hidden inside the array class itself.

Allow a user to work with high-level physical abstractions for scientific computation

Expression Template - high performance numerical libraries, where abstract mathematical notations via operator overloading in C++

Efficiently parallelizable for large scale scientific code.

Implementation for the expression templates mechanism based on “Curiously recurring template pattern" (CRTP) in c++.

What is Panini Library ?

C++ template library for CUDA

Support Data Structure :

— 1d , 2d and complex Vector on CUDA

— 1d , 2d Grid with multidimensional data on cuda

SOA as well as AOS Data Structure

Template & Operator Overloading

Loop Unrolling for small size vector

Lazy Evaluation

Common sub expression elimination

Template and Operator overloading

Containers/Objects Supported by Panini Small size vector on device

Large vector on device

Create complex array , grid 1d, 2d

using namespace Panini; int main() { int nX = 200; int nY = 200; vectET < double > coordX(nX,0.0); vectET < double > coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridO(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridM(nX, nY, 1,1); gridFlow2D<1> pressureM(nX, nY,1,1); gridFlow2D<1> pressureN(nX, nY,1,1); gridFlow2D<1> pressure(nX, nY,1,1); gridFlow2D<1> potentialO(nX, nY, 1,1);

Basic Feature of vectTiny/vectET Class

Direct Assignment

vectTiny <dataType, N> array

array =1.2, 3.5, 5.6

Binary operation

Scalar arithmetic operation

Math operation on vectTiny object

Type checking not required

supports single- and double-precision floating point values, complex numbers,

Booleans, 32-bit signed and unsigned integers

supports manipulating vectors, matrices, and N-dimensional arrays

Best Practice

Structure of Arrays

—Ensure memory coalescing

Array of Structure

Implicit Sequences

— Avoid explicitly storing and accessing regular patterns

— Eliminate memory accesses and storage

DataType

VectTiny : small size arrays – user know the size in advance

— vectTiny <float, 100 > a

VectET: large arrays or grid 1D or N-Dimensional.

— b in a grid of size 100, where at every point, we have a fixed array of size 3.

— VectET< vectTiny< myReal, 3 >> a(100).

gridFlow2D – grid 2D or N dimension

— gridFlow2D<T,FLOW_FIELD_2D>**myGrid;

Allowed Operation Three modes of initialization are provided

— vectTiny < double, 3> a=2;

— vectTiny < double, 3> a; a=1,2,3;

— vectTiny < double, 3> b; b =a;

All Math operations, Binary operation and Scalar operation.

— vectTiny < double, 3> a=0.1, b, c; b = sin(a) ; c = a + b ; c = 0.5*c;

All vectors operations

— vectTiny < double, 3> a, b, c,d; b = a+sin(c)+0.3*cos(d)

Vector operations are relying on following optimizations: Loop-unrolling (by hand), Lazy Evaluation.

Keep reference of object till the last iteration

Allowed Operation…

Lazy Evaluation

— vectTiny < double, 3> a, b, c,d;

— b = sin(c)+0.3*cos(d)

A typical operator overloading + virtual function approach will evaluate it in following sequence

— for(i=1,N) tmp[i]=cos(d[i]);

— for(i=1,N) tmp1[i]=0.3*tmp[i];

— for(i=1,N) tmp3[i]= sin(c[i]);

Panini supports optimized Fortran style code

— for(i=1,N)

— b[i] = sin(c[i])+0.3*cos(d[i])

Structure of Arrays

Coalescing improves memory efficiency

Accesses to arrays of arbitrary structures won’t coalesce

Reordering into structure of arrays ensures coalescing

— Struct float3{ float x;float y; floatz;}; float3 *aos;...aos[i].x = 1.0f;

— Struct float3_soa{float*x;float*y;float*z;}; float3_soa soa;...soa.x[i] = 1.0f;

Array of structures -

Structure of arrays (Best Practice)

struct Velocity

{

int ux;

int uy;

};

Velocity<FLOW_FIELD_2D> obj_vel(nX, nY, 1,1);

struct Pressure

{

float *pressure;

float *pressureM;

float *pressureN ;

};

gridFlow2D<1> pressureM(nX, nY,1,1);

gridFlow2D<1> pressureN(nX, nY,1,1);

gridFlow2D<1> pressure(nX, nY,1,1);

PlaceHolder Object

Implicit Sequences

— placeHolder IX(nX)

Often we need ranges following a sequential pattern

Constant ranges

[1, 1, 1, 1, ...]

Incrementing ranges

[0, 1, 2, 3, ...]

How Panini different from Array Fire Static resolution.

Approach used in Array fire is easy to develop for library, but that will have penalty in performance.

Very initial stage

Navier Stroke Example: How easy to write a scientific code using Panini Data Structure

Initial Condition

vectET <double> coordX(nX,0.0); vectET <double> coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN (nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridO ( nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridM ( nX, nY, 1, 1); gridFlow2D<1> pressureM (nX, nY, 1, 1); gridFlow2D<1> pressureN (nX, nY,1, 1); gridFlow2D<1> pressure (nX, nY, 1, 1); gridFlow2D<1> potentialO(nX, nY, 1,1);

myGridN(iX,iY).value(UX) = -2.0*M_PI*kY*phi*cos(coordX[iX]*kX)*sin(coordY[iY]*kY); myGridN(iX,iY).value(UY) = 2.0*M_PI*kX*phi*sin(coordX[iX]*kX)*cos(coordY[iY]*kY); pressure(iX,iY) = -M_PI*M_PI*phi*phi*(kY*kY*cos(2.0*coordX[iX]*kX)+kX*kX*cos(2.0*coordY[iX]*kY)); pressureM(iX,iY) -= 0.5*(myGridN(iX,iY).value(UX)*myGridN(iX,iY).value(UX)+myGridN(iX,iY).value(UY)*myGridN(iX,iY).value(UY));

Laplacian Equation

template<int N> void getLaplacian(gridFlow2D<N> gridVar,gridFlow2D<N> &gridLap, double c3, double c4, int iX, int iY) { gridLap(iX,iY) = gridVar(iX,iY) + c4*(gridVar(iX,iY+1) -2.0*gridVar(iX,iY) +gridVar(iX,iY-1)) + c3*(gridVar(iX+1,iY) -2.0*gridVar(iX,iY) + gridVar(iX-1,iY)); }

Serial Code – CPU Timing

No of Grid Point

CPU Timing

Time for 100 Iteration(sec)

Time for 200 Iteration (sec)

1.00E+04 0.00126063 0.00126172

4.00E+04 0.00625661 0.00624585

1.60E+05 0.0439781 0.044118

2.50E+05 0.0781446 0.0785149

CPU Timing – MPI Version

No. of Processors 100 iterations 200 iterations

1 0.0743643 0.0755658

2 0.0580707 0.0579703

4 0.054078 0.0507001

5 0.0447405 0.0420167

10 0.0382128 0.0365341

16 0.0379704 0.0372657

20 0.0367649 0.0390902

25 0.0472415 0.0589682

30 0.0645379 0.0627601

MPI Version of Panini Code

CPU Timing – MPI Version

No. of Processors 100 iterations 200 iterations

1 0.00741906 0.00716424

2 0.00583018 0.00561896

4 0.00725028 0.0067555

5 0.00634362 0.0071692

10 0.00856 0.010103

16 0.0102652 0.0102073

20 0.0114238 0.0103288

25 0.0104092 0.0103344

30 0.011299 0.0118635

MPI Version of Panini Code

CPU Timing vs GPU Timing

No of Grid Point

CPU Timing (sec)

GPU Timing (sec )

SpeedUp

100 iteration 100 iteration

100 x 100 0.001260 0.000441 2.72x

200 x 200

0.006256 0.001279 4.89x

400 x 400 0.043978 0.004311 10.09x

Curiously recurring template pattern

namespace Panini {

template <typename dataType, class input>

class baseET {

public:

typedef const input& inputRef;

// Return Reference to object input

Inline operator inputRef () const {

return *static_cast<const input*> (this ) ; }

inputRef getInputRef() const {

return static_cast<inputRef> (*this ) ; } // Every Base class will have member value

__device__ dataType value(const int i) const{

return static_cast<inputRef> (*this ).value(i) ;

}

};

This is the core of Vector Design.

The basic idea is that all input class will derive from this template based class where template input class will be derived class itself

Documents

Panini: A GPU Aware Array Class - GPU Technology