HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

HiPC 2010

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

Wenjing Ma Gagan Agrawal

The Ohio State University

HiPC 2010

GPGPU

General Purpose Programming on GPUs (accelerators)High performance/price ratioHigh language support

CUDAPerformance vs Productivity

Hard to programMemory hierarchy to manage...

HiPC 2010

Automatic code generationDevice memory access is expensive

Using shared memoryTexture and constant memoryCoalescing device memory access...

Get High Performance from GPU

And Make the Programming Simple!

HiPC 2010

FEATURES OF SHARED MEMORY

Small, fast, like a cache16KB on each multiprocessor (no more than

48KB even on the latest GPU)Read-write

Software controlled __shared__ float data[n][n];

Allocating shared memory:Similar to register allocation

HiPC 2010

Problem Formulation for Shared Memory Arrangement

Consider variables and basic blocks in a functionElement of array, array, section of array

Each variable can have several live ranges in the functionAccess feature of live range: read, write, read-

write, tempDetermine in which basic block a variable

is allocated to shared memory Assign_point[i][k]: variable i, basic block

k

HiPC 2010

Integer Programming Problem

Integer Linear ProgrammingObjective function

Maximize z = CT xConstraints

Solution

Values of xSpecial case of linear programming

All the unknown variables are integers (1-0 in our case)

Solvable for reasonable size of problems

HiPC 2010

Integer Programming for Shared Memory Arrangement

Objective FunctionMaximize shared memory usage Minimize data transfer between memory

hierarchies

HiPC 2010

Integer Programming for Shared Memory Arrangement (cnt’d)

Objective Function

HiPC 2010

An Example to Show size_alloc

for (int i=0; i<n; i++)

for (int j=0; j<m; j++)

for (int k = 0; k<r; k++)

C[k] += A[i][k]- B[j][k]; ......

HiPC 2010


ConstraintsTotal allocation does not exceed the limit of

shared memory at any time

Only at most one assign_point is 1 in each live range

HiPC 2010


Obtaining parametersUsing LLVM compiler frameworkPass 1: get access features

Read, write, read-write, temp

Pass 2: get live ranges, loop information, indices, and all other parameters

HiPC 2010

Code Generation

According to the shared memory arrangement obtained from the integer programming model

Under the framework in previous workMove data to cover gap caused by data

evicted from shared memory

HiPC 2010

An Example

A: n*rB: m*rC: rn: 2048m: 3r: 3NUM_THREADS: 256

assign_point[0][1]=1;assign_point[1][0]=1;assign_point[2][0]=1;/* all other elements of assign_point are 0 */

for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ......

Integer Programming

Solver

HiPC 2010

An Example (cnt’d)

Generated Code:

__shared__ float s_B[m][r];__shared__ float s_C[r*NUM_THREADS];__shared__ float s_A[r*NUM_THREADS];for(int i=0;i<m*r;i++) s_B[i]=B[i];for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) for(int k=0;k<r;k++) s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ......}/* Synchronize and combination of C */

HiPC 2010

Suggesting Loop Transformationfor (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++) tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]];}

for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0;for(int c = 0;c<numCol;c++) {

/* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }}

HiPC 2010

Experiments

Effectiveness of using shared memoryCompare with intuitive approach in previous

workGreedy sorting: sort all the variables in

increasing order of size, and allocation them on shared memory until to the limit of shared memory

Effectiveness of loop transformation suggested by the integer programming model

HiPC 2010

Experiment Results

HiPC 2010

Experiment Results

K-means EM

0

1

2

3

4

5

6

7

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

no shared memory basic Int-solved

0

1

2

3

4

5

6

256*

4

256*

8

256*

16

256*

32

256*

64


Tim

e (s

econ

ds)

basic Int-solved

HiPC 2010

Experiment Results (cnt’d)

PCA Co-clustering

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

128*

1

128*

2

128*

4

128*

8

128*

16


Tim

e (s

econ

ds)

no shared memory Int solved

0

2

4

6

8

10

12

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64


Tim

e (s

econ

ds)

basic Int-solved

HiPC 2010

Effect of Loop Transformation

PCA Co-clustering

0

2

4

6

8

10

12

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64


Tim

e (s

econ

ds)

non-transformed transformed

0

0.2

0.4

0.6

0.8

1

1.2

128*

1

128*

2

128*

4

128*

8

128*

16


Tim

e (s

econ

ds)

non-transformed transformed

HiPC 2010

Conclusion and Future Work

Proposed an integer programming model for shared memory arrangement on GPU

Consider numeric variable, array, and section of array

Suggested loop transformation for optimization

Got better results than the intuitive methodWill automate the code generation and loop

transformation selection in future

HiPC 2010

THANK YOU!

Questions?

Documents

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University