__gloabl__ proc(float *arr,float *brr){ float v; __shared__ float shared[L]; shared[threadIdx.x] =...

Preview:

Citation preview

__gloabl__ proc(float *arr,float *brr){ float v; __shared__ float shared[L]; shared[threadIdx.x] = brr[threadIdx.x]; __syncthreads();

if(threadIdx.x!=0){ v=arr[theadIdx.x]; v+=shared[threadIdx.x]; … }else{ … } … }

Modularity for HPC -WootinJ-

GPGPU in HPCMany SuperComputers support GPGPUTSUBAME2, Dawning Nebulae, …

Many non-functional concerns OptimizationHardware-awareFail-safe

Masayuki Ioki, Shumpei Hozumi,

Shigeru Chiba

Tokyo Institute of Technology

WootinJRuntime convertor from Java to CUDAGenerating CUDA code with runtime context

Delete some overheads in OOP DevirtualizationFlattening the structure of an objectto remove field access chains

Motivating exampleGPU has several types of memories.Global MemoryLarge but slow

Shared Memoryfast but small

__gloabl__ proc(float *arr, float *brr){ float v; if(threadIdx.x!=0){ v=arr[theadIdx.x]; v+= brr[theadIdx.x]; … }else{ … } … }

Global Memory

SMSM

SP

Shared Memory

SPStreaming Processors…

GPU

Streaming Multiprocessors

Non SharedMemory Ver.

__gloabl__ proc(float *arr,float *brr){ float v; __shared__ float shared[L]; shared[threadIdx.x] = arr[threadIdx.x]; __syncthreads();

if(threadIdx.x!=0){ v=shared[theadIdx.x]; v+=brr[threadIdx.x]; … }else{ … } … }

__gloabl__ proc(float *arr,float *brr){ float v; __shared__ float shared_a[L]; __shared__ float shared_b[L]; shared_a[threadIdx.x] = arr[threadIdx.x]; shared_b[threadIdx.x] = brr[threadIdx.x]; __syncthreads();

if(threadIdx.x!=0){ v=shared_a [theadIdx.x]; v+=shared_b[threadIdx.x]; }else{ … } ...}

brr -> SharedMemoryarr -> SharedMemory both -> SharedMemory

HPC Programmer hates OOP.OOP has rich modularitiesHowever OOP has many overheads.Dynamic method dispatchField access chain

class Calc{ Memory memA, memB …; @gloabl void proc(float[] arr, float[] brr) { … }}

Calc calc = new Calc();float[256] arr=..., brr= …; Dim3s dim3s = new Dim3s();dim3s.threadDim=new Dim3(256);

CUDAKicker    .run(dim3s,calc,"proc",arr,brr);

Java bytecode

Java AST

CUDA code

Run on GPUs

WootinJ Sample Code

Memory memA = new SimpleSharedMem(256);Memory memB = new Memory();

memA.set(arr,theadIdx.x);memB.set(brr,theadIdx.x);

void set_memA(float[] arr,int i){ /* SimpleSharedMem method */ }void set_memB(float[] arr,int i){ /* Memory method body */ }…set_memA(arr,theadIdx.x);set_memB(brr,theadIdx.x);

DevirtualizationDynamic method dispatch to Staticfind all actual types from given objects Micro Benchmark

matrix product

WootinJ’s overheads is about 2 sec. JVM start up CUDA code generate and compile

TSUBAME2 Super Computer CPUs : Intel Xeon 2.93 GHz * 2 GPUs : NVIDIA Tesla M2050 * 3 Memory : 54GB

GPUs

GFLO

PS

Compile-time check for Devirtualization

a = b

both types must be Strict-final

in @global method

assignment exp.

obj.m();

Return type must be Strict-final

return type of the method.Strict-final 1. primitive types are strict-final. 2. The class that is final class and all fields are strict-final, is strict-final. 3. An array that its element type is strictfinal, is strict-final.

@global is an annotation for

CUDA function.

My name is Wootin!

Recommended