Upload
vuonghuong
View
218
Download
0
Embed Size (px)
Citation preview
Dept of Computer & Information Sciences!University of Delaware!
Auto-tuning a High-level Language!Targeted to GPU Codes!
Scott Grauer-Gray, Robert Searles, Lifan Xu, Sudhee Ayalasomayajula, John Cavazos!
Op#mizing GPU Code
• Constantly “tweaking” GPU code – Lots of low level details
• Resul#ng code is briBle – Op#miza#ons are applica#on (and inputs) and device specific!
High-‐level language for GPUs
• High-‐level languages – Good produc#vity, but low performance
High-‐level language Manual Code Performance ?
Solu#on: Autotuning + HLL + GPUs
Goals of Project: High-Level Languages Low-Level Performance
HLL HLL
Best Optimized
GPU Program
Optimized GPU
Program
HMPP WORKBENCH
• High-‐Level Language for GPUs • Similar to OpenMP, but for GPUs
– Modify code through direc6ves
• Generates CUDA/OpenCL kernels
HMPP WORKBENCH (cont’d)
• Ini#a#ve to make an Open Standard – OpenHMPP (also OpenACC)
• Commercial product available here:
www.caps-‐entreprise.com/hmpp.html
HMPP WORKBENCH (cont’d)
• Direc6ves also drive GPU op#miza#ons – Permuta#on
– Tiling/unrolling – Fusion/fission
• But, there is no tool that helps programmer decide which op6miza6ons to use
Hard problem!
• Unroll makes copy of loop body – Pragma specifies “con#guous” unroll w/ factor 2
HMPP Unroll Pragma
• Collec6on of scien6fic kernels – Available at
http://www.cse.ohio-state.edu/~pouchet/software/polybench/
– Converted 14 of these kernels to CUDA, OpenCL, and HMPP
PolyBench
• Kernels coverted to CUDA/OpenCL – Linear algebra
• 2mm, 3mm, atax, bicg, gemm, gesummv, matmul, mvt, syr2k, syrk
– Linear algebra solvers • gramschmidt
– Datamining • correla#on, covariance
– Stencils • fdtd-‐2d
PolyBench
Pragma Descrip6on Parameter Values
Permute Re-‐orders loops in loop nest
Depends on kernel. Different ordering of loops
Unroll Unrolls loop at given factor
Unroll factors 1 through 8
Tile Tiles loop at given factor
Tiling factors 1 through 8
Blocksize Thread block dimensions
Kept fixed for these experiments
Op#miza#on Search Space
Best Op#mized GEMM version
• Original permuta#on
• No unrolling/#ling on “i’ and “j” loops • Unrolling with “con#guous” op#on on innermost “k” loop
• Experiments performed on C2050 GPU (Fermi) – 448 CUDA cores
• Autotuned HMPP versions of PolyBench – Generated op#mized-‐version of CUDA and OpenCL
• Compared against hand-‐coded CUDA and OpenCL
PolyBench Experiments
Number of Op#mized Versions
Program Op6mized Versions
2mm 97
3mm 118
atax 67
bicg 161
correla6on 153
covariance 448
fdtd 141
Program Op6mized Versions
gemm 168
gesummv 631
matmul 337
gramschmidt 727
mvt 108
syr2k 97
syrk 281
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
CUDA
Opt HMPP CUDA
Manual CUDA
32.8 19.1 2.51 3.04
Autotuning HMPP / Manual CUDA
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
CUDA
Opt HMPP CUDA
Manual CUDA
32.8 19.1 2.51 3.04
Autotuning benefits 6 HMPP programs
Autotuning HMPP / Manual CUDA
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
CUDA
Opt HMPP CUDA
Manual CUDA
32.8 19.1 2.51 3.04
Manual better than Optimized HMPP
Autotuning HMPP / Manual CUDA
Autotuning HMPP / Manual CUDA
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
CUDA
Opt HMPP CUDA
Manual CUDA
32.8 19.1 2.51 3.04
Autotuning did not help some cases!
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
Ope
nCL
Opt HMPP OpenCL
Manual OpenCL
37.9 46.4
5.51 5.51 3.94 2.82 2.78
Autotuning benefits 4 HMPP programs targeted to OpenCL
Autotuning HMPP/Manual OpenCL
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
Ope
nCL
Opt HMPP OpenCL
Manual OpenCL
37.9 46.4
5.51 5.51 3.94 2.82 2.78
7 manual codes performed better than best-optimized HMPP
Autotuning HMPP/Manual OpenCL
0
0.5
1
1.5
2
2.5
Speedu
p over Defau
lt HMPP
Ope
nCL
Opt HMPP OpenCL
Manual OpenCL
37.9 46.4
5.51 5.51 3.94 2.82 2.78
6 manual OpenCL programs performed poorly.
Autotuning HMPP/Manual OpenCL
Summary Results
CUDA Geo-‐mean
Best HMPP 1.46
Manual .70
OpenCL Geo-‐ mean
Best HMPP 1.42
Manual 1.43
On average, best autotuned versions meet or exceed manual performance!
Best Op#miza#ons Found
Code Best Op6miza6ons Found
HMPP CUDA HMPP OpenCL
MVT Tile all four loops using factor 2 Tile first and third loops using factor 2
3MM Unroll 3rd, 6th, and 9th loops using “split” op#on with factor 3
Unroll 3rd, 6th, and 9th loops using “con#guous” op#on with factor 6
GEMM Unroll innermost loop using “con#guous” op#on with factor 64
Unroll innermost loop using “con#guous” op#on with factor 64
SYRK Unroll 3rd loop using “split” op#on with factor 2
Unroll 3rd loop using “split” op#on with factor 2
• Autotuning can be expensive • Model can predict best optimization to
apply Program
Characteriza#on Op#miza#on sequence
… …
Output: Predicted performance
Predic#ve Modeling
Op#mizing Belief Propaga#on
0 5 10 15 20
OpenCL Imp.
CUDA Imp.
Speedup over CPU
0 5 10 15 20
Manual CUDA
Op6mized HMPP
Default HMPP
..
Conclusions
• Achieve low-‐level performance using high-‐level language (HMPP)
– Autotuning HMPP comparable to hand-‐op#mized GPU Code
– Best op#miza#ons are different for CUDA and OpenCL
Stereo Vision
• Two cameras take pictures – Separated by a distance
• Algorithm compares two images by shiming – Shifted amount is disparity
HMPP Belief Propaga#on
• Itera#ve algorithm
• Applied to stereo vision – Input: stereo set of two images – Output: disparity map between images
Image in Tsukuba stereo set Ground-truth disparity map
BP Message-‐Passing Func#on
• Disparity values computed for each pixel and passed to neighbors – Run#me dominated by this “message-‐passing” step
md
mu
mr ml
ml
ml mr
mu mu
md md
md
mu
mr ml
ml
mu mu
md md
mr ml ml mr
mr
ml mr
ml ml mr