Adaptive Input-aware Compilation for Graphics Engines

University of MichiganElectrical Engineering and Computer Science

1

Adaptive Input-aware Compilation for Graphics Engines

Mehrzad Samadi1, Amir Hormati2, Mojtaba Mehrara3, Janghaeng Lee1 and Scott Mahlke1

1University of Michigan - Ann Ar-bor

2Microsoft Research3NVIDIA Research


2

GPU Performance Gap• High performance at low cost• Peak performance is difficult to achieve

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 20130

500

1000

1500

2000

2500

3000

3500NVida GPU CPU

Th

eo

reti

ca

l G

Flo

ps

GeForce GTX 480

GeForce GTX 280

GeForce 8800 GTX

GeForce 7800 GTX

GeForce GTX 590

GeForce GTX 680

In Practice


3

TMV Performance on Various Input

0

2

4

6

8

10

12

14

16

18

20

Aspect Ratio

GF

LO

PS

Low Utilization

Efficient Execution HighOverhead

SquareMatrix

RectangularMatrix

Rect

angu

lar

Mat

rix


4

GPU Execution Model

Grid 1

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 7

Shared

Regs

0 1

2 3

4 5

6 7

ExecutesThread


5

Transposed Matrix Vector Multiplication (4 x 1M)

SM 0

Block 0

Thread 0 ~ 15 Thread 0 ~ 15

Block 3

Block 1Block 2

0 1

2 3

4 5

6 7Regs

Shared

SM 1

0 1

2 3

4 5

6 7Regs

Shared

SM 2

0 1

2 3

4 5

6 7Regs

Shared

SM 3

0 1

2 3

4 5

6 7Regs

Shared

SM 4

0 1

2 3

4 5

6 7Regs

Shared

SM 5

0 1

2 3

4 5

6 7Regs

Shared

SM 6

0 1

2 3

4 5

6 7Regs

Shared

SM 7

0 1

2 3

4 5

6 7Regs

Shared

IDLE


6

Transposed Matrix Vector Multiplication (1M x 4)

SM 0

0 1

2 3

4 5

6 7Regs

Shared

SM 1

0 1

2 3

4 5

6 7Regs

Shared

SM 2

0 1

2 3

4 5

6 7Regs

Shared

SM 3

0 1

2 3

4 5

6 7Regs

Shared

SM 4

0 1

2 3

4 5

6 7Regs

Shared

SM 5

0 1

2 3

4 5

6 7Regs

Shared

SM 6

0 1

2 3

4 5

6 7Regs

Shared

SM 7

0 1

2 3

4 5

6 7Regs

Shared

Block 0 ~ 7

Block 8 ~ 15

Block 1,000,000

125,000 blocks / SM


7

GPU Programming Challenge - PortabilityGPU Architectures Input Matrix Size Source Code

4 x 1M GTX285_MV_4_1M.cu

128 x 32K GTX285_MV_128_32K.cu

32K x 128 GTX285_MV_32K_128.cu

1M x 4 GTX285_MV_1M_4.cu


128 x 32K GTX580_MV_128_32K.cu

32K x 128 GTX580_MV_32K_128.cu



128 x 32K GTX680_MV_128_32K.cu

32K x 128 GTX680_MV_32K_128.cu


FastestMatrix-VectorMultiplicationfor any GPUfor any input size

Cores : 240

Cores : 512

Cores : 1536

2008

2011

2012


8

Adaptic• Adaptive Input-aware Compilation for GPUs

– Device-Portable– Input-Portable– Programmers can focus on the algorithms without

concerning about low-level details• Streaming Language

– Higher-level of abstraction– Separating Memory Access from Algorithm– e.g) StreamIt


9

Stream It• Higher-level of abstraction• Decoupling computation and mem-

ory accesses• Coarse grain exposed parallelism,

exposed communication• Streaming actors use buffers to

communicate• A lot of recent works on extending

portability of streaming applica-tions

Actor 1

Actor 2 Actor 5

Splitter

Actor 4Actor 3

Joiner

Actor 6


10

Compilation Flow in Adaptic

Input-awareOptimization

Input-unawareOptimization

StreamIt Code

Target GPU Input Range

Offline Compilation

Per

form

ance

Mod

el

Memory AccessOptimization

Actor Segmentation

Actor Integration

• Why?• Global Memory Accesses

• Large access latency

• Optimizations• Memory Restructuring

• Coalesced Access• Neighboring Access

• Data Reuse

• Splits Actors• More blocks will be generated• Alleviate resource under-utilization

• Optimizations• Stream Reduction• Intra-actor Parallelization

• Integrate Actors• Merge several actors into one• Alleviate high resource contention

• Optimizations• Vertical Integration• Horizontal Integration

Executable

SmallestInput

LargestInput

SmallInput

LargeInput

Input size?

Launch Kernel

Kernel 0 Kernel 1 Kernel 2 Kernel 3

Several CUDA Kernels for various input range


11

Memory Optimization• Global Memory - Large access latency• Not access the words in sequence• No coalescing

A[i, j] Actor A has i pops and j pushes

Thread 1 Thread 2 Thread 3Thread 0

1514131211109876543210

1514131211109876543210

A[4,4]

Global Memory

Global Memory2 6 10 14

2 6 10 141 5 9 13

1 5 9 13

0 4 8 12

0 4 8 12

3 7 11 15

3 7 11 15

A[4,4] A[4,4] A[4,4]


12

Memory Optimization• Global Memory - Large access latency• Not access the words in sequence• No coalescing

Thread 1 Thread 2 Thread 3Thread 0

1514131211109876543210

1514131211109876543210

A[4,4]

Global Memory

Global Memory

A[4,4] A[4,4] A[4,4]

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

2 6 10 14

1 5 9 13

0 4 8 12

3 7 11 15

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

2 6 10 14

1 5 9 13

0 4 8 12

3 7 11 15

A[i, j] Actor A has i pops and j pushes


13

Actor Segmentation

Actor 0 Actor 0 Actor 1 Actor 2 Actor 3

4 x 1M Transposed Matrix-Vector Multiplication

Block 0

Block 3

Block 1Block 2

Block 96

Block 32Block 64

~Block 0 Block 31


14

Actor Integration • Merges several actors or

threads to balance threads’ workloads

• Vertical integration: reducing off-chip memory traffic by stor-ing intermediate results in the shared memory.

• Horizontal integration : reduc-ing synchronization overhead and also lets the merged ac-tors share instructions.

Actor 1

Actor 4 Actor 7

Splitter

Actor 6Actor 5

Joiner

Actor 8

Actor 2

Actor 3

Actor 1

FusedActor 1

Actor 6

FusedActor 0


15

Experimental Setup• CPU - Intel Xeon X5650• GPU

– NVidia Telsa C2050• 3GB GDDR 5

– NVidia GTX 285• 2GB GDDR 2

• Benchmarks– CUBLAS Library 3.2– NVidia SDK 3.1


16

Result( Matrix Vector Multlipication)4x

256K

16x6

4K

64x1

6K

256x

4K

1Kx1

K

4Kx2

56

16Kx

64

64Kx

16

256K

x4

4x1M

16x2

56K

64x6

4K

256x

16K

1Kx4

K

4Kx1

K

16Kx

256

64Kx

64

256K

x16

1Mx4

16x1

M

64x2

56K

256x

64K

1Kx1

6K

4Kx4

K

16Kx

1K

64Kx

256

256K

x64

1Mx1

6

0

5

10

15

20

25

30

35

40

45

Adaptic CUBLAS Input Size

GFL

OPS

1M numbers 4M numbers 16M numbers


17

Results (Speedup)4M 1M

256K 64

K16

K 4K 1K 4M 1M25

6K 64K

16K 4K 1K 4M 1M

256K 64

K16

K 4K 1K 4M 1M25

6K 64K

16K 4K 1K

2x4M

4x2M

8x1M

16x5

12K

32x2

56K

64x1

28k

2x4M

4x2M

8x1M

16x5

12K

32x2

56K

64x1

28k

256x

16K

512x

8K1K

x4K

2Kx2

K4K

x1K

8Kx5

1216

Kx25

625

6x16

K51

2x8K

1Kx4

K2K

x2K

4Kx1

K8K

x512

16Kx

256

Isamax/Isamin Snrm2 Sasum Sdot Scalar Product

MonteCarlo Ocean FFT Convolution Separable

CUBLAS SDK

0

1

2

3

4

5

6

Spee

dup(

X)

Input Size


18

Results(BiCGSTAB)

C2050 GTX285 C2050 GTX285 C2050 GTX285 C2050 GTX285 C2050 GTX285512x512 1024x1024 2048x2048 4096x4096 8192x8192

0

1

2

3

4

5

6

7

8

9

10

Baseline Actor Segmentation Memory Optimizations Actor Integration

Input Size

Spee

dup

over

CU

BLAS

Input unawareInput unaware


19

Summary• Performance of GPU is affected by

– GPU Model / Input• CUDA / OpenCL Programming Model

– Lacks Architecture and Input Portability• Scientific Applications use irregular input

– Hard to get optimized performance• Proposed Adaptic

– Architecture and input portable /w streaming language– Showed speedup over CUBLAS / SDK in various input range


20

Q & A

Documents

Adaptive Input-aware Compilation for Graphics Engines