PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark...

Preview:

DESCRIPTION

Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.

Citation preview

synergy.cs.vt.edu

Automated CUDA-to-OpenCL Translation with CU2CL:What’s Next?

Wu Feng and Mark Gardner

Virginia Tech

2013-11-12

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Source code lasts longer than platforms

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Goal

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

To take advantage of OpenCL's portability...

Without sacrificing man-years of existing code

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

synergy.cs.vt.edu

CUDA and OpenCLExecution and Memory Models

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

CUDASourceCode

OpenCLSourceCode

xkcd.com

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

Automatic Translation(seconds)

CUDASourceCode

OpenCLSourceCode

CU2CL

xkcd.com

synergy.cs.vt.edu

Forecast

• Observations about Translating

• Examples: CUDA and OpenCL constructs

• CU2CL Architecture

• Current State of CU2CL: Robustness and Performance

• Future Directions

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z

• Between languages, e.g., CUDA → OpenCL

__powf(x[threadIdx.x], y[threadIdx.y]) →

native_pow(x[get_local_id(0)], y[get_local_id(1)])

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??

• Dialects

– Latin American Spanish vs. Castilian Spanish → English

– CUDA Runtime API vs. CUDA Driver API → OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Initialization Code

None(Implicit)

Dialect: CUDA runtime API

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y]) CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

Struct

Struct

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

Write Out

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

Write Out

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

Advantage: formatting remains intact → maintainable

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

CUDA Kernel Launch

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

CUDA Kernel Launch

Naive OpenCL Translation

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),

&__cu2cl_Kernel_kernel_arg_1);

int __cu2cl_Kernel_kernel_arg_2 = 256;

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),

&__cu2cl_Kernel_kernel_arg_2);

CUDA Kernel Launch

Correct OpenCL Translation

Naive OpenCL Translation

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

– Implement our own handler to emulate and encapsulate

//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator CoverageApplication CUDA Lines OpenCL Lines

ChangedPercentAutomatically Translated

asyncAPI 135 5 96.3

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

FastWalshTransform 327 30 90.8

matrixMul 351 9 97.4

scalarProd 251 18 92.8

vectorAdd 147 0 100

Back Propagation 313 24 92.3

Breadth-First Search 306 35 88.6

Gaussian 390 26 93.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

Fen Zi 17768 1786 89.9

GEM 524 15 97.1

IZ PS 8402 166 98.0

SDK Samples

Rodinia

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator CoverageApplication CUDA Lines OpenCL Lines

ChangedPercentAutomatically Translated

asyncAPI 135 5 96.3

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

FastWalshTransform 327 30 90.8

matrixMul 351 9 97.4

scalarProd 251 18 92.8

vectorAdd 147 0 100

Back Propagation 313 24 92.3

Breadth-First Search 306 35 88.6

Gaussian 390 26 93.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

Fen Zi 17768 1786 89.9

GEM 524 15 97.1

IZ PS 8402 166 98.0

SDK Samples

Rodinia

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Challenges

Challenge CUDA SDKFrequency ()%

RodiniaFrequency ()%

Device Identifiers 54.4 29.4

Literal Parameters 19.0 23.5

Separate Compilation 54.4 29.4

CUDA Libraries 10.1 0

Kernel Templates 21.5 0

Texture Memory 27.8 23.5

Graphics Interoperability 24.1 0

Constant Memory 17.7 29.4

Shared Memory 46.8 70.6

Profiled Identified

Kernel Function Pointer InvocationsPreprocessor EffectsWarp-level SynchronizationDevice Intrinsic FunctionsDevice Buffer cl_mem Type Propagation#defined Function DefinitionsDevice Buffers as Struct MembersArrays of Device BuffersImplicitly-Defined Kernel FunctionsDevice-side Classes, Constructors, & DestructorsStruct Alignment Attributes__threadfence()

Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator Performance

100 1000 10000 1000000.01

0.1

1

10

SDK Samples Rodinia Samples Large Applications

100 1000 10000 1000001

10

100

1000

10000R+ = 0.61R+ = 0.95

SDK Samples Linear (SDK Samples)Rodinia Samples Linear (Rodinia Samples)Large Applications

Source Lines Source Lines

Tota

l Tra

nsl

ati

on T

ime

(s)

CU

2C

L T

rans l

ati

on T

ime

(mic

rose

conds)

Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translated Application PerformanceasyncAPI

bandwidthTest

BlackScholes

FastWalshTransform

matrixMul

scalarProd

vectorAdd

backprop BFS

Gaussian

Hotspot

Needleman-Wunsch

GEM

0.5

1

1.5

2

2.5

CU

DA

OpenC

L

Tim

e (

s)

SDK Samples Rodinia Samples

Lower is Better

Note: all runs on same Nvidia GPU for fair comparison purposes

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Reliability0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

20.3%

52.9%

20.3%

52.9%

12.7%

5.9%

21.5%

23.5%

15.2%

5.9%

24.1%

2.5%

5.9%

1.3%

11.4%

11.8%

68.3%

35.3%

5.9%

FailedPartialComplete

Clang 3.2main() method handlingTemplate handling

OpenGL #defined function handlingSeparately declared and defined function handlingKernel pointer invocation handling

CUDA SDK Samples

CUDA SDK Samples

Rodinia Samples

Rodinia Samples

Before

Upgrades

After

Upgrades

Increase reliability in translating samples after latest round of improvements

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization

What about CUDA to HSA?

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Related Work

Swan– High-level abstraction API, links to either OpenCL or CUDA

implementation

Ocelot & Caracal– Translate NVIDIA PTX IR to other device IRs

CUDAtoOpenCL– Source to source translator, based on Cetus

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps

• What's Next?– Improved functional portability

– Support for performance portability

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AcknowledgementsStudents: Gabriel Martinez, Paul Sathre

This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC).