77
GRIFFON GPU PROGRAMMING API FOR SCIENTIFIC AND GENERAL PURPOSE PISIT MAKPAISIT 4909611727 SUPERVISOR : DR. WORAWAN DIAZ CARBALLO DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF SCIENCE AND TECHNOLOGY, THAMMASAT UNIVERSITY

Griffon Topic2 Presentation (Tia)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Griffon Topic2 Presentation (Tia)

GRIFFON GPU PROGRAMMING API FOR SCIENTIFIC AND GENERAL PURPOSE

PISIT MAKPAISIT 4909611727SUPERVISOR : DR. WORAWAN DIAZ CARBALLO

DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF SCIENCE AND TECHNOLOGY, THAMMASAT UNIVERSITY

Page 2: Griffon Topic2 Presentation (Tia)

04/08/2023

2

Griffon - GPU Programming API for Scientific and General Purpose

• GPU-CPU performance gap • GPGPU• GPU programming model complexity

Motivation

Page 3: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

3

GPU-CPU performance gap

All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional

processors

http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf

Page 4: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

4

GPGPU

General-Purpose computation on Graphics Processing Units

Very high computation and data throughput

Scalability

Page 5: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

5

GPGPU Applications

Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc.

Page 6: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

6

Vector Addition

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

6 9 7 9 14 7 7 11 15 7Vector C

=

Page 7: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

7

Vector Addition (Sequential Code)

#include <stdio.h>

#define SIZE 500

void VecAdd(float *A, float *B, float *C){

int i;

for(i=0;i<SIZE;i++)

C[i] = A[i] + B[i]

}

Declare Function

void main(){int i, size = SIZE *

sizeof(float);float *A, *B, *C;

Declare Variables

A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

Memory Allocate

free(A);free(B);free(C);

}

VecAdd(A,B,C);Function Call

Memory De-Allocate

Page 8: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

8

Vector Addition (Sequential Code)

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

Vector C

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7

Page 9: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

9

Improve Performance

We can improve vector with parallel computing

Data Parallelism – simultaneously add each elements

1st choice

Multicore on CPU OpenMP

2nd choice

Multicore on GPU CUDA

Page 10: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

10

Vector Addition (OpenMP)

#include <stdio.h>#define SIZE 500

void VecAdd(float *A, float *B, float *C){int i;

for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]

}void main(){

int i, size = SIZE * sizeof(float);

float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

VecAdd(A,B,C);

free(A);free(B);free(C);

}

1. Sequential Code#pragma omp parallel for

2. Add Compiler Directive

3. Finish

Page 11: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

11

Vector Addition (OpenMP)

1 5 6 8 9 1 2 3 6 5Vector A

5 4 1 1 5 6 5 8 9 2Vector B

+

Vector C

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7

Page 12: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

12

Speed Up (Amdahl’s Law)

Execution time (Sequential)

Vector Addition ~ 80%

Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2

Execution time (Parallel on CPU)

Vector Addition

Page 13: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

13

OpenMP

Easy and automatic threads management

Few threads on CPU

Page 14: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

14

Vector Addition (GPU - CUDA)

1 5 6 8 9 1 2 3 6 5

Vector A on CPU

5 4 1 1 5 6 5 8 9 2

Vector B on CPU

+

Vector C on CPU

=

6

+

=

9

+

=

7

+

=

9

+

=

14

+

=

7

+

=

7

+

=

11

+

=

15

+

=

7

1 5 6 8 9 1 2 3 6 5

5 4 1 1 5 6 5 8 9 2

6 9 7 9 14 7 7 11 15 7

Copy

Copy

CPU Memory GPU Memory

Page 15: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

15

Parallel Vector Addition on GPU (CUDA)

#include <stdio.h>

#define SIZE 500

__global__ void VecAdd(float* A, float* B, float* C){

int idx = threadIdx.x;

if(idx < SIZE)

C[idx] = A[idx] + B[idx];

}

Declare Kernel Function

void main(){int i, size = SIZE * sizeof(float);float *h_A, *h_B, *h_C, *d_A, *d_B,

*d_C;

Declare Variables

h_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);

CPU Memory Allocate

cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);

GPU Memory Allocate

Page 16: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

16

Parallel Vector Addition on GPU (CUDA)

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

Data Transfer from CPU to GPU

addVec<<<1, SIZE>>>(d_A, d_B, d_C);Kernel Call

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Data Transfer from GPU to CPU

free(h_A);free(h_B);free(h_C);

CPU Memory De-Allocate

cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

}

GPU Memory De-Allocate

Page 17: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

17

Speed Up (Amdahl’s Law)

Execution time (Sequential)

Vector Addition ~ 80%

Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16

Execution time (Parallel on GPU)

Page 18: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

18

CUDA

Speed up but spend more effort and time Many threads on GPU

Page 19: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

19

CUDA Memory Model

Global Memory – Off-chip, large, shared by all threads, slow, host can read and write

Local Memory – per one thread , faster than Global Memory

Shared Memory – shared by all threads in block, faster than Global Memory

Page 20: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

20

Griffon

Simple programming model (OpenMP)

Computing Performance (GPU - CUDA)+

=Easy and Efficient (Griffon)

Page 21: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

21

Parallel Vector Addition on GPU (Griffon)

#include <stdio.h>#define SIZE 500

void VecAdd(float *A, float *B, float *C){int i;

for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]

}void main(){

int i, size = SIZE * sizeof(float);

float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);

VecAdd(A,B,C);

free(A);free(B);free(C);

}

1. Sequential Code#pragma gfn parallel for

So Easy !!

2. Add Compiler Directive

3. Finish

Page 22: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

22

Griffon

Compiler directive for C-Language

Source-to-source compiler Automatic data management Optimization

Page 23: Griffon Topic2 Presentation (Tia)

04/08/2023

23

Griffon - GPU Programming API for Scientific and General Purpose

Objectives

Page 24: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

24

Objectives (1/2)

To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-source compiler Simple – The numbers of compiler directives do not

exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API.

Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes.

Page 25: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

25

Objectives (2/2)

To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management

of generated codes is done automatically by Griffon.

Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%.

Page 26: Griffon Topic2 Presentation (Tia)

04/08/2023

26

Griffon - GPU Programming API for Scientific and General Purpose

Project Constraint

Page 27: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

27

Project Constraint

Griffon is a C-language API that supports both Windows and Linux environments

The generated executable program can only run on the NVIDIA graphic card.

Uses can use Griffon in cooperated with OpenMP.

Page 28: Griffon Topic2 Presentation (Tia)

04/08/2023

28

Griffon - GPU Programming API for Scientific and General Purpose

Related Works

Page 29: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

29

Brook+ & CUDA

General propose computation on GPU Manual kernel and data transfer on

various GPU memory management Vendor dependent

Page 30: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

30

OpenCL (Open Computing Language)

Cross-platform and Vendor neutral Approachable language for accessing

heterogeneous computational resources (CPU, GPU, other processor)

Data and Task Parallelism

Page 31: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

31

OpenMP to GPGPU

OpenMP applications into CUDA-based GPGPU applications

GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality

Page 32: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

32

hiCUDA

Directive-based GPU Programming Language

Computation Model for identify code region that executed on GPU

Data Model for allocate and de-allocate memory on GPU and data transfer

Page 33: Griffon Topic2 Presentation (Tia)

04/08/2023

33

Griffon - GPU Programming API for Scientific and General Purpose

• Software Architecture• Directives• Griffon Compilation Process• Optimization Techniques

Methodology

Page 34: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

34

Software Architecture

NVCC is one of the Griffon toolchain.

Griffon source-to-source compiler comprises oMemory Allocator and Optimizer

Griffon CompilerGriffon Compiler

NVCC (NVIDIA CUDA Compiler)

Griffon C Application

CUDA C Application

PTX compiler GCC (Linux),CL (MS

Windows)

PTX code C code

CPU object codeGPU object code

Executable

Compile-time Memory Allocator

Optimizer

Page 35: Griffon Topic2 Presentation (Tia)

04/08/2023

35

Griffon - GPU Programming API for Scientific and General Purpose

Directives

Page 36: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

36

Griffon Directives

Parallel Region

Control Flow

GPU/CPU Overlap Compute

Synchronous

Define synchronou

s point

Specify kernel work

flow

Define region that CPU overlap

compute with GPU

Define parallel region

Page 37: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

37

Directives

#pragma gfn directive-name [clause[ [,] clause]...] new-line

#pragma gfn parallel for [clause[ [,] clause]...] new-linefor-loops

 Clause : kernelname(name)

waitfor(kernelname-list)private(var-list)accurate([low,high])reduction(operator:var-list)

Parallel Region

General Form

Page 38: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

38

Parallel Region

for(i=0;i<N;i++){C[i] = A[i] +

B[i];}

#pragma gfn parallel forfor(i=0;i<N;i++){

C[i] = A[i] + B[i];}

Page 39: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

Kernel Flow Control39

#pragma gfn parallel for kernelname( A ) #pragma gfn parallel for kernelname( B ) waitfor( A ) #pragma gfn parallel for kernelname( C ) waitfor( A ) #pragma gfn parallel for kernelname( D ) waitfor( B,C )

A

CB

D

Kernel B and C can compute in parallel

Page 40: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

40

Synchronization

#pragma gfn barrier new-line

#pragma gfn atomic newlineassignment-statement

Atomic

Synchronous Point

#pragma gfn parallel for reduction(operation,var-list)

Parallel Reduction

P0P0

P1P1

P2P2P3P3

P0P0

P1P1

P2P2

P3P3Barr

ier

Page 41: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

41

Synchronization

#pragma gfn parallel forfor(i=1;i<N-1;i++){

B[i] = A[i-1] + A[i] + A[i+1;#pragma gfn barrierA[i] = B[i];if(A[i] > 7){

#pragma gfn atomicC[i] += x / 5;

}}

for(i=1;i<N-1;i++){B[i] = A[i-1] + A[i] +

A[i+1;}for(i=1;i<N-1;i++){

A[i] = B[i];if(A[i] > 7){

C[i] += x / 5;}

}

#pragma gfn parallel forfor(i=1;i<N-1;i++){

B[i] = A[i-1] + A[i] + A[i+1;}#pragma gfn parallel forfor(i=1;i<N-1;i++){

A[i] = B[i];if(A[i] > 7){

#pragma gfn atomicC[i] += x / 5;

}}

Option 1

Option 2

Page 42: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

42

Synchronization

#pragma gfn parallel for \private(x) reduction(+:integral)for (i = 1; i <= n-1; i++) {

x = a + (i * h); integral = integral + f(x);}

for (i = 1; i <= n-1; i++) {x = a + (i * h);

integral = integral + f(x);}

Page 43: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

43

GPU/CPU Overlap compute

#pragma gfn overlapcompute(kernelname) newlinestructure-block

Many threads on GPU

CPU function

GPU/CPU Synchronize

Parallel

Page 44: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

44

GPU/CPU Overlap compute

for(i=0;i<N;i++){…

}independenceCpuFunction();

#pragma gfn parallel for kernelname( calA )for(i=0;i<N;i++){

…}#pragma gfn overlapcompute( calA )independenceCpuFunction();

Page 45: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

45

Accurate Level

#pragma gfn parallel for accurate( [low, high] )

Use low when speed is important

Use high when precision is important

Default is high

Page 46: Griffon Topic2 Presentation (Tia)

04/08/2023

46

Griffon - GPU Programming API for Scientific and General Purpose

Griffon Compilation Process

Page 47: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

47

Create Kernel

int main(){int sum = 0;int x, y;#pragma gfn parallel

for \ private(x, y) reduction(+:sum)

for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);

C[i] = x + y; }return 0;

}

__global__ void __kernel_0(…, int __N){int __tid = blockIdx.x * blockDix.x +

threadIdx.x;int i = __tid [* 1 + 0] ;

if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;

}}int main(){

int sum = 0;int x, y;

__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1);

// Insert kernel callreturn 0;

}

Page 48: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

48

For-Loop Format and Thread Mapping

For-loop must be in format for( index = min ; index <= max ; index += increment ){

…}

for( index = max ; index >= min ; index -= increment ){ …} // This case will be transformed to first case

Number of Thread can calculate by formula

Iterative Index and Thread Mapping__tid = blockIdx.x * blockDix.x + threadIdx.x;index = __tid * increment + min;

Page 49: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

49

Private and shared variable management

Shared variables much be pass to kernel function

Private variables mush be declare in kernel fucntion

Declare GPU device variables for shared variable Size for allocate

Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc

Page 50: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

50

Private and shared variable management

int main(){int sum = 0;int x, y; int A[N], B[N], C[N] ;

#pragma gfn parallel for \ private(x, y) reduction(+:sum)

for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);

C[i] = x + y; }return 0;

}

__global__ void __kernel_0(int * A, int * B, int * C, int __N){

int __tid = blockIdx.x * blockDix.x + threadIdx.x;

int i = __tid [* 1 + 0] ;int x, y;

if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;

}}int main(){

int sum = 0;int x, y;int A[N], B[N], C[N] ;int * __d_A ,* __d_B ,* __d_C ;cudaMalloc((void**)&__d_C,sizeof(int) * N);cudaMalloc((void**)&__d_B,sizeof(int) * N);cudaMalloc((void**)&__d_A,sizeof(int) * N);

__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1);

cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A);

return 0;}

Page 51: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

51

Reduction variable management

int main(){…#pragma gfn parallel for \ reduction(+:sum)for(i=0;i<MAX;i++){

...sum += A[i];...

}...

}

__global__ void __kernel_0(float *A, float * global___sum_add){int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid ;int __rtid = threadIdx.x ;__shared__ int __sum_add[512] ;int sum = 0 ;

 __sum_add[__rtid] = 0;if( __tid < __N ){

…sum += c[i];

__sum_add[__rtid] = sum;__syncthreads();if(__rtid < 256) __sum_add[__rtid] +=

__sum_add[__rtid + 256];__syncthreads();if(__rtid < 128) __sum_add[__rtid] +=

__sum_add[__rtid + 128];__syncthreads();if(__rtid < 64) __sum_add[__rtid] +=

__sum_add[__rtid + 64];__syncthreads();if(__rtid < 32) __sum_add[__rtid] +=

__sum_add[__rtid + 32];__syncthreads();if(__rtid < 16) __sum_add[__rtid] +=

__sum_add[__rtid + 16];if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid

+ 8];if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid

+ 4];if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid

+ 2];if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid

+ 1];}if(__rtid == 0)

atomicAdd(global___sum_add, __sum_add[0]);}

Very complex because optimize parallel reduction implementation

Page 52: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

52

Replace math functions & GPU functions

int f1(int a){return ++a;

}int f0(int a){

return f1(a) + 5;}

#pragma gfn parallel forfor(i=0;i<N;i++){

A[i] = f0(A[i]) + sin(B[i]);

}

__device__ int __device_f1(int a){return ++a;

}__device__ int __device_f0(int a){

return __device_f1(a) + 5;}

__global__ void __kernel_1(int *A, int *B, int N){…A[i] = __device_f0(A[i]) + __sinf(B[i]);

}

Page 53: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

53

Barrier and Atomic

__global__ void __kernel_A(…){if(tid<__N){

B[i] = A[i-1] + A[i] + A[i+1; #pragma gfn barrier

A[i] = B[i];#pragma gfn atomicC[i] += x / 5;

}}

__global__ void __kernel_A(…){if(tid<__N){

B[i] = A[i-1] + A[i] + A[i+1; __threadfence();

A[i] = B[i];atomicAdd(&C[i], x / 5);

}}

Page 54: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

54

Kernel call and data transfer sort

Detail in optimization section

__kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add); cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost );cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost );cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost );

Page 55: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

55

Automatic cache with shared memory

Detail in optimization section

__global__ void __kernel_0 (int * B, int * A, int __N){

int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;

if(__tid < __N){

sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +

sa[threadIdx.x + 1 + 1];}

}

#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){

B[i] = A[i-1] + A[i] + A[i+1];}

Page 56: Griffon Topic2 Presentation (Tia)

04/08/2023

56

Griffon - GPU Programming API for Scientific and General Purpose

• Maximum thread on GPU• Reduce data transfer with analysis control flow• Reduce data transfer with kernel control flow• Overlapping kernel and data transfer and asynchronous data transfer • Automatic cache with shared memory

Optimization Techniques

Page 57: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

57

Reduce data transfer with analysis control flow

A, B transfer from CPU to GPU C transfers from GPU to CPU D is both

#pragma gfn parallel forfor(i=0;i<N;i++){

C[i] = A[i] + B[i] + D[i];

D[i] = C[i] * 0.5;}

Used variable Defined variable

Page 58: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

58

Reduce data transfer with kernel control flow

Memcpy Host to Device for Variable that is defined in kernel Memcpy Device to Host for Variable that is used in kernel

#pragma gfn parallel for

for(i=0;i<N;i++){C[i] = A[i] + B[i];

}

cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice );

cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice );

Kernel <<< … , … >>> ( … )

cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);

K1

A

C

B

Page 59: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

59

Reduce data transfer with kernel control flow

Use graph defined by kernelname and waitfor construct

K1

K2

A

DCC

A

B

E

#pragma gfn parallel for \kernelname(k1)for(i=0;i<N;i++){

C[i] = A[i] + B[i];}#pragma gfn parallel for \kernelname(k2) waitfor(k1) for(i=0;i<N;i++){

E[i] = A[i] * C[i] – D[i];C[i] = E[i] / 3.0;

}

C

Page 60: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

60

Reduce data transfer with kernel control flow

If there is a path from k1 to k21. If invar of k1 is

same as invar of k2 delete invar of k2

2. If outvar of k1 is same as outvar of k2 delete outvar of k1

3. if outvar of k1 is same as invar of k2 delete invar of k2

K1

K2

A

DCC

A

B

E C

Page 61: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

Schedule Kernel and Memcpy for Maximum overlap

K1

K2

AB

D

K3

C

E

Already reduce transfer nodes graph

How to schedule?

Page 62: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

62

Schedule for synchronous function

K1 K2AB D K3C E

62

Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE)

New version of CUDA API has asynchronous data transfer function

Page 63: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

63

Schedule Kernel and Memcpy for Maximum overlap

Memcpy and Kernel can be overlaped

Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost

4-ways overlap If include CPU compute by overlapcompute directive

K1

K2

A

B

D K3

C

E

Level 1

Level 2

Level 3

Level 4

1 2

12 3

12

1

Page 64: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

64

K1

K2

A

B

D K3

C

E

Level 1

Level 2

Level 3

Level 4

1 2

12 3

12

1

1. Set queue to empty2. Until all node is deleted

1.1. Set level =1 and stream_num = 1;1.2. Find 0 incoming degree kernel node,

delete node and link, create transfer command with stream_num1.2.1. if found in 1.2 stream_num += 1

1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num1.3.1 if found in 1.3 stream_num += 1

1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num1.4.1 if found in 1.4 stream_num += 1

1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node

1.6. Insert synchronous function1.7. Collect max stream_num1.8. level += 1;

Page 65: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

65

Automatic cache with shared memory

When detect “linear access” pattern in kernel automatic cache will work

Thread block1

Global Memory

Shared

Shared

Shared

#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){

B[i] = A[i-1] + A[i] + A[i+1];}

Thread block2

Thread block 3

… Shared

Thread block n

Page 66: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

66

Automatic cache with shared memory

__global__ void __kernel_0 (int * B, int * A, int __N){

int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;

if(__tid < __N){

sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +

sa[threadIdx.x + 1 + 1];}

}

#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){

B[i] = A[i-1] + A[i] + A[i+1];}

Page 67: Griffon Topic2 Presentation (Tia)

04/08/2023

67

Griffon - GPU Programming API for Scientific and General Purpose

DEMO

Page 68: Griffon Topic2 Presentation (Tia)

04/08/2023

68

Griffon - GPU Programming API for Scientific and General Purpose

• Compiler Directives• Compiler Performance

Evaluation

Page 69: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

69

Compiler Directives

Program 1 Program 2 Program 30

5

10

15

20

25

30

GriffonCUDA

Program

Tim

e (

min

ute

)

5 undergraduate students who have studied the concepts of CUDA

only 1.5 hour of demonstration

Page 70: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

70

Compiler Directives

PNI PMC TR VN SOV0

20

40

60

80

100

120

Sequen-tial

CUDA

Griffon

Application

Lin

es o

f co

de

s

Calculation of Pi Using Numerical Integration

Calculation of Pi Using the Monte Carlo Method

Trapezoidal Rule Vector

Normalization Calculate Sine of

Vector’s Element

Page 71: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

71

Compiler Performance

PNI PMC TR VN SOV0

5

10

15

20

25

SequencialParallel (Griffon)

Application

Sp

ee

d U

p

Expected Speed up

Calculation of Pi Using Numerical Integration

Calculation of Pi Using the Monte Carlo Method

Trapezoidal Rule Vector

Normalization Calculate Sine of

Vector’s Element

Page 72: Griffon Topic2 Presentation (Tia)

04/08/2023

72

Griffon - GPU Programming API for Scientific and General Purpose

Conclusion

Page 73: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

73

Griffon Instruction

Total numbers of instructions (Directive +

Clause): 9 Problem is performance of high

communication degree parallel program Improve directive for describe algorithm in

program (Divide and conquer, Partial summation, etc.)

New optimization technique such as cache with shared memory, appropriate thread number

Page 74: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

74

Performance factor and speed up

Parallelism

Data Transfer

Computation

Density

Speed Up

Calculation of Pi Using Numerical

Integration

High Very Low Low 1.76

Calculation of Pi Using the Monte Carlo Method

High Average High 7.36

Trapezoidal Rule High Very Low High 19.28

Vector Normalization High High Low 1.21

Calculate Sine of Vector’s Element

Very High High High 3.78Computation density is most effect on Performance

Page 75: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

75

Building S2S Compiler

Source to source compilers aren’t popular

Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX

compiler could be very efficient, they cannot gain any benefits from manual optimization.

Page 76: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

76

Future Work

Optimization Techniques Data Structure Loop transformation

Directives More support OpenMP CPU/GPU Parallel region Support OpenCL

Compiler Support C++, other language Support popular IDE

Page 77: Griffon Topic2 Presentation (Tia)

04/08/2023Griffon - GPU Programming API for Scientific and General Purpose

77

Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley

Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern

Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org Ilias Leontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU

Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler

Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based

Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley