69
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACCPartII 1

An!Introduc+on!to!OpenACC! PartII

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An!Introduc+on!to!OpenACC! PartII

 

An  Introduc+on  to  OpenACC  Part  II  

 Wei  Feinstein  

HPC  User  Services@LSU    

LONI  Parallel  Programming  Workshop  2015  Louisiana  State  University  

   

 4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   1  

Page 2: An!Introduc+on!to!OpenACC! PartII

Roadmap    •  Recap  of  OpenACC  •  OpenACC  with  GPU-­‐enabled  library  •  Code  profiling    •  Code  tuning  with  performance  tool  •  Programming  on  mul+-­‐GPUs  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   2  

Page 3: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   3  

GPU

Large Scale of applications

Genomics   Deep    Learning  

Costal  Storm  Predic+on  

Many-core CPU

Multi-core CPU

OpenMP   OpenACC   Pragmas-­‐based  

Compiler-­‐direc+ves  high  programming  level  

Page 4: An!Introduc+on!to!OpenACC! PartII

Applications

Libraries Programming

Languages Compiler Directives

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   4  

Heterogeneous  Programming  on  GPUs  

“Drop-in” Acceleration

Maximum Flexibility

Easily Accelerate Applications

Page 5: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   5  

OpenACC Execution Model Application Code

GPU CPU Generate Parallel Code for GPU

Compute-Intensive Functions Rest of Sequential

CPU Code

z  

z  

$acc parallel

$acc end parallel

Page 6: An!Introduc+on!to!OpenACC! PartII

Two separate memory spaces between host and accelerator ■

Data transfer by DMA transfers Hidden from the programmer in OpenACC, so beware: ■

Latency Bandwidth

Limited device memory size

Accelerator: ■

No guarantee for memory coherence → beware of race conditions

Cache management done by compiler, user may give hints

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   6  

OpenACC Memory Model

Page 7: An!Introduc+on!to!OpenACC! PartII

1.  Copy  input  data  from  CPU  memory  to                  GPU  memory  

PCIe Bus

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   7  

Data  Flow  

Page 8: An!Introduc+on!to!OpenACC! PartII

Data  Flow  

1.  Copy  input  data  from  CPU  memory  to  GPU  memory  

2.  Execute  GPU  Kernel  

PCIe Bus

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   8  

Page 9: An!Introduc+on!to!OpenACC! PartII

1.  Copy  input  data  from  CPU  memory  to  GPU  memory  

2.  Execute  GPU  Kernel  3.  Copy  results  from  GPU  memory  to  

CPU  memory  

PCIe Bus

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   9  

Data  Flow  

Page 10: An!Introduc+on!to!OpenACC! PartII

Basic  OpenACC  direc+ves  C/C++

#pragma acc directive-name [clause [[,] clause]...]

Fortran

!$acc directive-name [clause [[,] clause]...]

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   10  

Page 11: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   11  

“Kernels  /  Parallel”  Constructs  •  Kernels

     

•  Parallel    

C/C++ #pragma acc kernels [clauses]

Fortran !$acc kernels [clauses]

C/C++

Fortran #pragma acc parallel loop [clauses]

!$acc parallel loop [clauses]

Page 12: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   12  

“Data” Construct

Data: management  of  data  transfer  between  host  and  device  

C/C++

Fortran #pragma acc data [clauses]

!$acc data [clauses]

Page 13: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   13  

“host_data” Construct

•  Make  the  address  of  device  data  available  on  host  

•  Specified  variable  addresses  refer  to  device  memory      

•  Variables  must  be  present  on  device  •   Can  only  be  used  within  a  data  region  

C/C++ #pragma acc kernels host_data use_device(list)

Fortran !$acc kernels host_data use_device(list)

Page 14: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   14  

OpenACC  compilers  

OpenACC Standard

•  PGI  compiler  for  C,  C++  and  Fortran    •  Cray  CCE  compilers  for  Cray  systems    •  CAPS  compilers    

Page 15: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   15  

Page 16: An!Introduc+on!to!OpenACC! PartII

SAXPY  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   16  

 void  saxpy(int  n,  float  a,  float  *x,  float  *y)  {                  for  (int  i  =  0;  i  <  n;  ++i)                                y[i]  =  a*x[i]  +  y[i];  }  

Page 17: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   17  

void  saxpy_acc(int  n,  float  a,  float  *x,  float  *y)  {                #pragma  acc  kernels              for  (int  i  =  0;  i  <  n;  ++i){                      y[i]  =  a  *  x[i]  +  y[i];                  }    }    int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  kernels                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY        saxpy_acc(n,  a,  x,  y);  }        …  

Saxpy_serial

Page 18: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   18  

void  saxpy_acc(int  n,  float  a,  float  *x,  float  *y)  {                #pragma  acc  kernels              for  (int  i  =  0;  i  <  n;  ++i){                      y[i]  =  a  *  x[i]  +  y[i];                  }    }    int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY        saxpy_acc(n,  a,  x,  y);  }        …  

Saxpy_openacc_v1

Parallel  the  loop  

Page 19: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   19  

void  saxpy_acc(int  n,  float  a,  float  *x,  float  *y)  {                #pragma  acc  parallel  loop              for  (int  i  =  0;  i  <  n;  ++i){                      y[i]  =  a  *  x[i]  +  y[i];                  }  a  }    int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  data  create(x[0:n])  copyout(y[0:n)]            #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY        saxpy_acc(n,  a,  x,  y);  }        …  

Saxpy_openacc_v2

Data  management  Parallel  the  loop  

Page 20: An!Introduc+on!to!OpenACC! PartII

cublasSaxpy  from  cuBlAS  library  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   20  

•   A  func+on  in  the  standard  Basic  Linear  Algebra  Subrou+nes  (BLAS)  library  

•  cuBLAS:  GPU-­‐accelerated  drop-­‐in  library  ready  to  be  used  on  GPUs.        

void  cublasSaxpy(  int      n,                        const  float    *alpha,                                                              const  float    *x,    

               int    incx,                                                float    *y,    

               int      incy)  

Page 21: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   21  

void  saxpy_acc(int  n,  float  a,  float  *x,  float  *y)  {                #pragma  acc  parallel  loop              for  (int  i  =  0;  i  <  n;  ++i){                      y[i]  =  a  *  x[i]  +  y[i];                  }  a  }    int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  data  create(x[0:n])  copyout(y[0:n])            #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY        saxpy_acc(n,  a,  x,  y);  }        …  

extern  void  cublasSaxpy(int,float,float*,int,float*,int);        int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  data  create(x[0:n])  copyout(y[0:n])            #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY          #pragma  acc  host_data  use_device(x,y)              cublasSaxpy(n,  2.0,  x,  1,  y,  1);        …  

Saxpy_cuBLAS

hsp://docs.nvidia.com/cuda  

Saxpy_openacc_v2

Page 22: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   22  

void  saxpy_acc(int  n,  float  a,  float  *x,  float  *y)  {                #pragma  acc  parallel  loop              for  (int  i  =  0;  i  <  n;  ++i){                      y[i]  =  a  *  x[i]  +  y[i];                  }  a  }    int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  data  create(x[0:n])  copyout(y[0:n])            #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY        saxpy_acc(n,  a,  x,  y);  }        …  

extern  void  cublasSaxpy(int,float,float*,int,float*,int);        int  main(){  …  //  Ini+alize  vectors  x,  y  #pragma  acc  data  create(x[0:n])  copyout(y[0:n])            #pragma  acc  parallel  loop                  for  (int  i  =  0;  i  <  n;  ++i)  {                        x[i]  =  1.0f;  y[i]  =  0.0f;                        }        //  Perform  SAXPY          #pragma  acc  deviceptr  (x,y)              cublasSaxpy(n,  2.0,  x,  1,  y,  1);        …  

Saxpy_cuBLAS

hsp://docs.nvidia.com/cuda  

Saxpy_openacc_v2

Page 23: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   23  

GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra FFT, BLAS, SPARSE, Matrix

Numerical & Math RAND, Statistics

Data Struct. & AI Sort, Scan, Zero Sum

Visual Processing Image & Video

NVIDIA cuFFT, cuBLAS, cuSPARSE

NVIDIA Math Lib NVIDIA cuRAND

NVIDIA NPP

NVIDIA Video

Encode

GPU AI – Board Games

GPU AI – Path Finding

Page 24: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   24  

Optimize Data Locality

Optimize Loop Performance

Parallelize Loops with OpenACC

Application Profiling

Page 25: An!Introduc+on!to!OpenACC! PartII

The  Himeno  code  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   25  

●  3D Poisson equation solver ●  Iterative loop evaluating 19-point stencil ●  Memory intensive, memory bandwidth bound

●  Fortran and C implementations are available from hsp://accc.riken.jp/2467.htm  

●  The scalar version for simplicity

●  We will discuss the parallel version using OpenACC

 

Page 26: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   26  

Optimize Data Locality

Optimize Loop Performance

Parallelize Loops with OpenACC

Application Profiling

Page 27: An!Introduc+on!to!OpenACC! PartII

Applica+on  Profiling  •  pgprof  -­‐  PGI  performance  profiler        •  gprof  -­‐  GNU  command  line  profiler    

•  nvprof  -­‐  command  line  profiler  -­‐nvprof  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   27  

pgcc  –Minfo=ccff  –o  yourcode_exe  yourcode.c  pgcollect    yourcode_exe    pgprof  –exe  yourcode_exe    

gcc  –pg  –o  yourcode_exe  yourcode.c  ./yourcode_exe        gprof    yourcode_exe  gmon.out  >  yourcode_pro.output  

Page 28: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   28  

Page 29: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   29  

Page 30: An!Introduc+on!to!OpenACC! PartII

Applica+on  Profiling  •  pgprof  -­‐  PGI  visual  profiler        •  gprof  -­‐  GNU  command  line  profiler    

•  nvprof  -­‐  command  line  profiler  -­‐nvprof  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   30  

pgcc  –Minfo=ccff  –o  yourcode_exe  yourcode.c  pgcollect    yourcode_exe    pgprof  –exe  yourcode_exe    

gcc  –pg  –o  yourcode_exe  yourcode.c  ./yourcode_exe        gprof    yourcode_exe  gmon.out  >  yourcode_pro.output  

Page 31: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   31  

Page 32: An!Introduc+on!to!OpenACC! PartII

Amdahl’s  Law  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   32  

Page 33: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   33  

Optimize Data Locality

Optimize Loop Performance

Parallelize Loops with OpenACC

Application Profiling

Page 34: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   34  

Performance  Profiling  via  NVVP  

Page 35: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   35  

Performance  Profiling  via  NVVP  

Page 36: An!Introduc+on!to!OpenACC! PartII

Performance  Profiling  via  NVVP  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   36  

Page 37: An!Introduc+on!to!OpenACC! PartII

Performance  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   37  

Why did OpenACC slow down here?

Page 38: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   38  

Optimize Data Locality

Optimize Loop Performance

Parallelize Loops with OpenACC

Application Profiling

Page 39: An!Introduc+on!to!OpenACC! PartII

Data  transfer  •  Data  movement  is  expensive  causing  bosleneck  to  performance  

•  Minimize  data  movement  •  Data  caching    

– #pragma  acc  data  copyin/copyout/copy  • Allocate  memory  on  device  and  copy  data  from  host  to  device,  or  back  ,  or  both  

– #pragma  acc  data  present      

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   39  

Page 40: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   40  

Improved  performance  with  beser  data  locality  

Page 41: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   41  

Improved  performance  with  beser  data  locality  

Page 42: An!Introduc+on!to!OpenACC! PartII

Improved  performance  with  beser  data  locality  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   42  

Page 43: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   43  

Optimize Data Locality

Optimize Loop Performance

Parallelize Loops with OpenACC

Application Profiling

Page 44: An!Introduc+on!to!OpenACC! PartII

Three  Levels  of  Parallelism  OpenACC  provides  more  detailed  control  over  paralleliza+on  via  gang,  worker,  and  vector  clauses  •  Gang:    

–  Share  itera+ons  across  the  gangs  (grids)  of  a  parallel  region  •  Worker:  

–  Share  itera+ons  across  the  workers  (warps)  of  the  gang  •  Vector:  

–  Execute  the  itera+ons  in  SIMD  mode  

 

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   44  

Page 45: An!Introduc+on!to!OpenACC! PartII

CUDA  Kernels:  Parallel  Threads  

A  kernel  is  a  function  executed  on  the  GPU  as  an  array  of  threads  in  parallel  

All  threads  execute  the  same  code,  can  take  different  paths  

Each  thread  has  an  ID  Select  input/output  data  Control  decisions  

float x = input[threadIdx.x]; float y = func(x); output[threadIdx.x] = y;

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   45  

Page 46: An!Introduc+on!to!OpenACC! PartII

CUDA  Kernels:  Subdivide  into  Blocks  

Threads are grouped into blocks

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   46  

Page 47: An!Introduc+on!to!OpenACC! PartII

CUDA  Kernels:  Subdivide  into  Blocks  

Threads are grouped into blocks Blocks are grouped into a grid

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   47  

Page 48: An!Introduc+on!to!OpenACC! PartII

Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

CUDA  Kernels:  Subdivide  into  Blocks  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   48  

Page 49: An!Introduc+on!to!OpenACC! PartII

MAPPING OPENACC TO CUDA

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   49  

Page 50: An!Introduc+on!to!OpenACC! PartII

OpenACC  Execution  Model  on  CUDA  The  OpenACC  execu+on  model  has  three  levels:  gang,  worker,  and  vector  

 For  GPUs,  the  mapping  is  implementa+on-­‐dependent.  Some  possibili+es:  

gang==block,  worker==warp,  and  vector==threads  of  a  warp  

 Depends  on  what  the  compiler  thinks  is  the  best  mapping  for  the  problem  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   50  

Page 51: An!Introduc+on!to!OpenACC! PartII

The  OpenACC  execu+on  model  has  three  levels:  gang,  worker,  and  vector  

 For  GPUs,  the  mapping  is  implementa+on-­‐dependent.  

 ...But  explicitly  specifying  that  a  given  loop  should  map  to  gangs,  workers,  and/or  vectors  is  op+onal  anyway  

Further  specifying  the  number  of  gangs/workers/vectors  is  also  op+onal  So  why  do  it?  To  tune  the  code  to  fit  a  par+cular  target  architecture  in  a  straigh|orward  and  easily  re-­‐tuned  way.  

OpenACC  Execution  Model  on  CUDA  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   51  

Page 52: An!Introduc+on!to!OpenACC! PartII

Three  Levels  of  Parallelism  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   52  

C/C++

Fortran

#pragma acc parallel [num_gangs()/num_workers()/vectoor_length()]

!$acc parallel [num_gangs()/num_workers()/vectoor_length()]

C/C++

Fortran

#pragma acc loop[(num_gangs)/(num_workers)/(vectoor_length)]

!$acc parallel [(num_gangs)/(num_workers)/(vectoor_length)]

Page 53: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   53  

Mul+ple  GPUs  card  on  a  single  node?    

Page 54: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   54  

•  Internal control variables (ICVs): –  Acc-device-type-var → Controls which type of accelerator is used

–  Acc-device-num-var → Controls which accelerator device is used

•  Setting ICVs by API calls –  acc_set_device_type() acc_set_device_num()

•  Querying of ICVs –  acc_get_device_type() acc_get_device_num()

 

Device  Management  

Page 55: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   55  

acc_get_num_devices

•  Returns the number of accelerator devices attached to host and the argument specifies type of devices to count

C: –  int acc_get_num_devices(acc_device_t) Fortran: –  Integer function acc_get_num_devices(devicetype)

 

Device  Management  

Page 56: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   56  

acc_set_device_num •  Sets ICV ACC_DEVICE_NUM •  Specifies which device of given type to use for next region Can not be called

in a parallel, kernels or data region

C: –  Void acc_set_device_num(int,acc_device_t)

Fortran: –  Subroutine acc_set_device_num(devicenum,devicetype)

 

Device  Management  

Page 57: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   57  

•  Acc_get_device_num –  Return value of ICV ACC_DEVICE_NUM –  Return which device of given type to use for next region –  Can not be called in a parallel, kernels or data region

•  C: –  Void acc_get_device_num(acc_device_t)

•  Fortran: –  Subroutine acc_get_device_num(devicetype)

 

Device  Management  

Page 58: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   58  

Direc+ve-­‐based  programming  on    single  node  with  mul+-­‐GPU  cards  

SAXPY  Code  

//  ini+aliza+on  

//  calcula+on  

Page 59: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   59  

Direc+ve-­‐based  programming  on    single  node  with  mul+-­‐GPU  cards  

Page 60: An!Introduc+on!to!OpenACC! PartII

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   60  

•  OpenACC  only  supports  one  GPU  •  Hybrid  model:  

–   OpenACC  +  OpenMP  to  support  mul+-­‐GPU  parallel  programming      

•  Limita+ons  – Lack  direct  device-­‐to-­‐device  communica+ons    

 

Direc+ve-­‐based  programming  on    single  node  with  mul+-­‐GPU  cards  

Page 61: An!Introduc+on!to!OpenACC! PartII

Conclusions  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   61  

•  OpenACC  is  a  powerful  programming  model  using  compiler  direc+ves  

•  Progressive,  produc+ve  code  por+ng    •  Portable  and  easy  to  maintain    •  Interoperability  •  Advanced  features  provide  deeper  control    

Page 62: An!Introduc+on!to!OpenACC! PartII

Introduc+on  to  OpenACC  PartII  Lab  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   62  

Page 63: An!Introduc+on!to!OpenACC! PartII

Ge}ng  Started  Connect  to  shelob  cluster:  

 ssh  [email protected]  Extract  the  lab  to  your  account:  

 tar  xzvf  /home/user/himeno.tar.gz    Change  to  the  lab  directory:  

 cd  himeno    Request  a  interac+ve  node  

 qsub  –I  –A  allocaCon  –lwallCme=2:00:00  –lnodes=1:ppn=16  Login  in  to  the  interac+ve  node  

 ssh  –X  shelobxxx  

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   63  

Page 64: An!Introduc+on!to!OpenACC! PartII

Exercise  1  Goal:  code  profiling  to  iden+fy  the  target  for  paralleliza+on  (use  your  own  code  would  be  great)  

 cd  PracCcal1  pgprof:  PGI  visual  profiler    

 pgcc  –Minfo=ccff  mycode.c  –o  mycode    pgcollect  mycode    pgprof  –exe  mycode          

     

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   64  

Page 65: An!Introduc+on!to!OpenACC! PartII

Exercise  1  Goal:  code  profiling  to  iden+fy  the  target  for  paralleliza+on  (use  your  own  code  would  be  great)  gprof:  GNU  profiler  

 gcc  mycode.c  –o  mycode  –pg    ./mycode    gprof  mycode  gmon.out  >mycode_profile.output      

     4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   65  

Page 66: An!Introduc+on!to!OpenACC! PartII

Exercise  2  Goal:  Iden+fy  hot  spots  in  your  code  to  improve  performance  

 cd  PracCcal2  Compile  source  code  (e.g.  version  1)  

 make  ver=01    Check  performance  at  command  line  (-­‐Minfo=accel  is  turned  on)  

 ./himeno.01  Use  nvvp  visual  profiler  by  typing:  

 nvvp      

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   66  

Page 67: An!Introduc+on!to!OpenACC! PartII

Exercise  3  Goal:  use  nvvp  to  fine  tune  your  code  for  beser  performance  via  using  more  OpenACC  direc+ves  

 cd  PracCcal3  Compile  the  source  code  (e.g.  version  1)  

 make  ver=01    Use  nvvp  visual  profiler  by  typing:  

 nvvp      

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   67  

Page 68: An!Introduc+on!to!OpenACC! PartII

Exercise  4  Goal:  use  OpenACC  with  GPU-­‐enabled  library  

 cd  PracCcal4  pgprof:  PGI  visual  profiler    

 pgcc  –Minfo=ccff  mycode.c  –o  mycode    pgcollect  mycode    pgprof  –exe  mycode          

     

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   68  

Page 69: An!Introduc+on!to!OpenACC! PartII

Exercise  5  Goal:  use  mul+-­‐GPUs  cards  on  a  single  node  

 cd  PracCcal5          

     

4th  HPC  Parallel  Programming  Workshop   An  Introduc+on  to  OpenACC-­‐PartII   69