Upload
scarlett-webb
View
216
Download
0
Embed Size (px)
Citation preview
HiPC 2010
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS
Wenjing Ma Gagan Agrawal
The Ohio State University
HiPC 2010
GPGPU
General Purpose Programming on GPUs (accelerators)High performance/price ratioHigh language support
CUDAPerformance vs Productivity
Hard to programMemory hierarchy to manage...
HiPC 2010
Automatic code generationDevice memory access is expensive
Using shared memoryTexture and constant memoryCoalescing device memory access...
Get High Performance from GPU
And Make the Programming Simple!
HiPC 2010
FEATURES OF SHARED MEMORY
Small, fast, like a cache16KB on each multiprocessor (no more than
48KB even on the latest GPU)Read-write
Software controlled __shared__ float data[n][n];
Allocating shared memory:Similar to register allocation
HiPC 2010
Problem Formulation for Shared Memory Arrangement
Consider variables and basic blocks in a functionElement of array, array, section of array
Each variable can have several live ranges in the functionAccess feature of live range: read, write, read-
write, tempDetermine in which basic block a variable
is allocated to shared memory Assign_point[i][k]: variable i, basic block
k
HiPC 2010
Integer Programming Problem
Integer Linear ProgrammingObjective function
Maximize z = CT xConstraints
Solution
Values of xSpecial case of linear programming
All the unknown variables are integers (1-0 in our case)
Solvable for reasonable size of problems
HiPC 2010
Integer Programming for Shared Memory Arrangement
Objective FunctionMaximize shared memory usage Minimize data transfer between memory
hierarchies
HiPC 2010
Integer Programming for Shared Memory Arrangement (cnt’d)
Objective Function
HiPC 2010
An Example to Show size_alloc
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
for (int k = 0; k<r; k++)
C[k] += A[i][k]- B[j][k]; ......
HiPC 2010
Integer Programming for Shared Memory Arrangement (cnt’d)
ConstraintsTotal allocation does not exceed the limit of
shared memory at any time
Only at most one assign_point is 1 in each live range
HiPC 2010
Integer Programming for Shared Memory Arrangement (cnt’d)
Obtaining parametersUsing LLVM compiler frameworkPass 1: get access features
Read, write, read-write, temp
Pass 2: get live ranges, loop information, indices, and all other parameters
HiPC 2010
Code Generation
According to the shared memory arrangement obtained from the integer programming model
Under the framework in previous workMove data to cover gap caused by data
evicted from shared memory
HiPC 2010
An Example
A: n*rB: m*rC: rn: 2048m: 3r: 3NUM_THREADS: 256
assign_point[0][1]=1;assign_point[1][0]=1;assign_point[2][0]=1;/* all other elements of assign_point are 0 */
for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ......
Integer Programming
Solver
HiPC 2010
An Example (cnt’d)
Generated Code:
__shared__ float s_B[m][r];__shared__ float s_C[r*NUM_THREADS];__shared__ float s_A[r*NUM_THREADS];for(int i=0;i<m*r;i++) s_B[i]=B[i];for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) for(int k=0;k<r;k++) s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ......}/* Synchronize and combination of C */
HiPC 2010
Suggesting Loop Transformationfor (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++) tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]];}
for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0;for(int c = 0;c<numCol;c++) {
/* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }}
HiPC 2010
Experiments
Effectiveness of using shared memoryCompare with intuitive approach in previous
workGreedy sorting: sort all the variables in
increasing order of size, and allocation them on shared memory until to the limit of shared memory
Effectiveness of loop transformation suggested by the integer programming model
HiPC 2010
Experiment Results
HiPC 2010
Experiment Results
K-means EM
0
1
2
3
4
5
6
7
256*
4
256*
8
256*
16
256*
32
256*
64
256*
256
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
no shared memory basic Int-solved
0
1
2
3
4
5
6
256*
4
256*
8
256*
16
256*
32
256*
64
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
basic Int-solved
HiPC 2010
Experiment Results (cnt’d)
PCA Co-clustering
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
128*
1
128*
2
128*
4
128*
8
128*
16
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
no shared memory Int solved
0
2
4
6
8
10
12
128*
1
128*
2
128*
4
128*
8
128*
16
128*
32
128*
64
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
basic Int-solved
HiPC 2010
Effect of Loop Transformation
PCA Co-clustering
0
2
4
6
8
10
12
128*
1
128*
2
128*
4
128*
8
128*
16
128*
32
128*
64
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
non-transformed transformed
0
0.2
0.4
0.6
0.8
1
1.2
128*
1
128*
2
128*
4
128*
8
128*
16
Configuration (threads_per_block * blocks)
Tim
e (s
econ
ds)
non-transformed transformed
HiPC 2010
Conclusion and Future Work
Proposed an integer programming model for shared memory arrangement on GPU
Consider numeric variable, array, and section of array
Suggested loop transformation for optimization
Got better results than the intuitive methodWill automate the code generation and loop
transformation selection in future
HiPC 2010
THANK YOU!
Questions?