Upload
lars-gentry
View
32
Download
4
Embed Size (px)
DESCRIPTION
Adaptive Input-aware Compilation for Graphics Engines. Mehrzad Samadi 1 , Amir Hormati 2 , Mojtaba Mehrara 3 , Janghaeng Lee 1 and Scott Mahlke 1. 1 University of Michigan - Ann Arbor 2 Microsoft Research 3 NVIDIA Research. GPU Performance Gap. High performance at low cost - PowerPoint PPT Presentation
Citation preview
University of MichiganElectrical Engineering and Computer Science
1
Adaptive Input-aware Compilation for Graphics Engines
Mehrzad Samadi1, Amir Hormati2, Mojtaba Mehrara3, Janghaeng Lee1 and Scott Mahlke1
1University of Michigan - Ann Ar-bor
2Microsoft Research3NVIDIA Research
University of MichiganElectrical Engineering and Computer Science
2
GPU Performance Gap• High performance at low cost• Peak performance is difficult to achieve
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 20130
500
1000
1500
2000
2500
3000
3500NVida GPU CPU
Th
eo
reti
ca
l G
Flo
ps
GeForce GTX 480
GeForce GTX 280
GeForce 8800 GTX
GeForce 7800 GTX
GeForce GTX 590
GeForce GTX 680
In Practice
University of MichiganElectrical Engineering and Computer Science
3
TMV Performance on Various Input
0
2
4
6
8
10
12
14
16
18
20
Aspect Ratio
GF
LO
PS
Low Utilization
Efficient Execution HighOverhead
SquareMatrix
RectangularMatrix
Rect
angu
lar
Mat
rix
University of MichiganElectrical Engineering and Computer Science
4
GPU Execution Model
Grid 1
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 7
Shared
Regs
0 1
2 3
4 5
6 7
ExecutesThread
University of MichiganElectrical Engineering and Computer Science
5
Transposed Matrix Vector Multiplication (4 x 1M)
SM 0
Block 0
Thread 0 ~ 15 Thread 0 ~ 15
Block 3
Block 1Block 2
0 1
2 3
4 5
6 7Regs
Shared
SM 1
0 1
2 3
4 5
6 7Regs
Shared
SM 2
0 1
2 3
4 5
6 7Regs
Shared
SM 3
0 1
2 3
4 5
6 7Regs
Shared
SM 4
0 1
2 3
4 5
6 7Regs
Shared
SM 5
0 1
2 3
4 5
6 7Regs
Shared
SM 6
0 1
2 3
4 5
6 7Regs
Shared
SM 7
0 1
2 3
4 5
6 7Regs
Shared
IDLE
University of MichiganElectrical Engineering and Computer Science
6
Transposed Matrix Vector Multiplication (1M x 4)
SM 0
0 1
2 3
4 5
6 7Regs
Shared
SM 1
0 1
2 3
4 5
6 7Regs
Shared
SM 2
0 1
2 3
4 5
6 7Regs
Shared
SM 3
0 1
2 3
4 5
6 7Regs
Shared
SM 4
0 1
2 3
4 5
6 7Regs
Shared
SM 5
0 1
2 3
4 5
6 7Regs
Shared
SM 6
0 1
2 3
4 5
6 7Regs
Shared
SM 7
0 1
2 3
4 5
6 7Regs
Shared
Block 0 ~ 7
Block 8 ~ 15
Block 1,000,000
125,000 blocks / SM
University of MichiganElectrical Engineering and Computer Science
7
GPU Programming Challenge - PortabilityGPU Architectures Input Matrix Size Source Code
4 x 1M GTX285_MV_4_1M.cu
128 x 32K GTX285_MV_128_32K.cu
32K x 128 GTX285_MV_32K_128.cu
1M x 4 GTX285_MV_1M_4.cu
4 x 1M GTX580_MV_4_1M.cu
128 x 32K GTX580_MV_128_32K.cu
32K x 128 GTX580_MV_32K_128.cu
1M x 4 GTX580_MV_1M_4.cu
4 x 1M GTX680_MV_4_1M.cu
128 x 32K GTX680_MV_128_32K.cu
32K x 128 GTX680_MV_32K_128.cu
1M x 4 GTX680_MV_1M_4.cu
FastestMatrix-VectorMultiplicationfor any GPUfor any input size
Cores : 240
Cores : 512
Cores : 1536
2008
2011
2012
University of MichiganElectrical Engineering and Computer Science
8
Adaptic• Adaptive Input-aware Compilation for GPUs
– Device-Portable– Input-Portable– Programmers can focus on the algorithms without
concerning about low-level details• Streaming Language
– Higher-level of abstraction– Separating Memory Access from Algorithm– e.g) StreamIt
University of MichiganElectrical Engineering and Computer Science
9
Stream It• Higher-level of abstraction• Decoupling computation and mem-
ory accesses• Coarse grain exposed parallelism,
exposed communication• Streaming actors use buffers to
communicate• A lot of recent works on extending
portability of streaming applica-tions
Actor 1
Actor 2 Actor 5
Splitter
Actor 4Actor 3
Joiner
Actor 6
University of MichiganElectrical Engineering and Computer Science
10
Compilation Flow in Adaptic
Input-awareOptimization
Input-unawareOptimization
StreamIt Code
Target GPU Input Range
Offline Compilation
Per
form
ance
Mod
el
Memory AccessOptimization
Actor Segmentation
Actor Integration
• Why?• Global Memory Accesses
• Large access latency
• Optimizations• Memory Restructuring
• Coalesced Access• Neighboring Access
• Data Reuse
• Splits Actors• More blocks will be generated• Alleviate resource under-utilization
• Optimizations• Stream Reduction• Intra-actor Parallelization
• Integrate Actors• Merge several actors into one• Alleviate high resource contention
• Optimizations• Vertical Integration• Horizontal Integration
Executable
SmallestInput
LargestInput
SmallInput
LargeInput
Input size?
Launch Kernel
Kernel 0 Kernel 1 Kernel 2 Kernel 3
Several CUDA Kernels for various input range
University of MichiganElectrical Engineering and Computer Science
11
Memory Optimization• Global Memory - Large access latency• Not access the words in sequence• No coalescing
A[i, j] Actor A has i pops and j pushes
Thread 1 Thread 2 Thread 3Thread 0
1514131211109876543210
1514131211109876543210
A[4,4]
Global Memory
Global Memory2 6 10 14
2 6 10 141 5 9 13
1 5 9 13
0 4 8 12
0 4 8 12
3 7 11 15
3 7 11 15
A[4,4] A[4,4] A[4,4]
University of MichiganElectrical Engineering and Computer Science
12
Memory Optimization• Global Memory - Large access latency• Not access the words in sequence• No coalescing
Thread 1 Thread 2 Thread 3Thread 0
1514131211109876543210
1514131211109876543210
A[4,4]
Global Memory
Global Memory
A[4,4] A[4,4] A[4,4]
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
2 6 10 14
1 5 9 13
0 4 8 12
3 7 11 15
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
2 6 10 14
1 5 9 13
0 4 8 12
3 7 11 15
A[i, j] Actor A has i pops and j pushes
University of MichiganElectrical Engineering and Computer Science
13
Actor Segmentation
Actor 0 Actor 0 Actor 1 Actor 2 Actor 3
4 x 1M Transposed Matrix-Vector Multiplication
Block 0
Block 3
Block 1Block 2
Block 96
Block 32Block 64
~Block 0 Block 31
University of MichiganElectrical Engineering and Computer Science
14
Actor Integration • Merges several actors or
threads to balance threads’ workloads
• Vertical integration: reducing off-chip memory traffic by stor-ing intermediate results in the shared memory.
• Horizontal integration : reduc-ing synchronization overhead and also lets the merged ac-tors share instructions.
Actor 1
Actor 4 Actor 7
Splitter
Actor 6Actor 5
Joiner
Actor 8
Actor 2
Actor 3
Actor 1
FusedActor 1
Actor 6
FusedActor 0
University of MichiganElectrical Engineering and Computer Science
15
Experimental Setup• CPU - Intel Xeon X5650• GPU
– NVidia Telsa C2050• 3GB GDDR 5
– NVidia GTX 285• 2GB GDDR 2
• Benchmarks– CUBLAS Library 3.2– NVidia SDK 3.1
University of MichiganElectrical Engineering and Computer Science
16
Result( Matrix Vector Multlipication)4x
256K
16x6
4K
64x1
6K
256x
4K
1Kx1
K
4Kx2
56
16Kx
64
64Kx
16
256K
x4
4x1M
16x2
56K
64x6
4K
256x
16K
1Kx4
K
4Kx1
K
16Kx
256
64Kx
64
256K
x16
1Mx4
16x1
M
64x2
56K
256x
64K
1Kx1
6K
4Kx4
K
16Kx
1K
64Kx
256
256K
x64
1Mx1
6
0
5
10
15
20
25
30
35
40
45
Adaptic CUBLAS Input Size
GFL
OPS
1M numbers 4M numbers 16M numbers
University of MichiganElectrical Engineering and Computer Science
17
Results (Speedup)4M 1M
256K 64
K16
K 4K 1K 4M 1M25
6K 64K
16K 4K 1K 4M 1M
256K 64
K16
K 4K 1K 4M 1M25
6K 64K
16K 4K 1K
2x4M
4x2M
8x1M
16x5
12K
32x2
56K
64x1
28k
2x4M
4x2M
8x1M
16x5
12K
32x2
56K
64x1
28k
256x
16K
512x
8K1K
x4K
2Kx2
K4K
x1K
8Kx5
1216
Kx25
625
6x16
K51
2x8K
1Kx4
K2K
x2K
4Kx1
K8K
x512
16Kx
256
Isamax/Isamin Snrm2 Sasum Sdot Scalar Product
MonteCarlo Ocean FFT Convolution Separable
CUBLAS SDK
0
1
2
3
4
5
6
Spee
dup(
X)
Input Size
University of MichiganElectrical Engineering and Computer Science
18
Results(BiCGSTAB)
C2050 GTX285 C2050 GTX285 C2050 GTX285 C2050 GTX285 C2050 GTX285512x512 1024x1024 2048x2048 4096x4096 8192x8192
0
1
2
3
4
5
6
7
8
9
10
Baseline Actor Segmentation Memory Optimizations Actor Integration
Input Size
Spee
dup
over
CU
BLAS
Input unawareInput unaware
University of MichiganElectrical Engineering and Computer Science
19
Summary• Performance of GPU is affected by
– GPU Model / Input• CUDA / OpenCL Programming Model
– Lacks Architecture and Input Portability• Scientific Applications use irregular input
– Hard to get optimized performance• Proposed Adaptic
– Architecture and input portable /w streaming language– Showed speedup over CUBLAS / SDK in various input range
University of MichiganElectrical Engineering and Computer Science
20
Q & A