IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton...

Preview:

DESCRIPTION

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009Note that some slides were borrowed from NVIDIA.

Citation preview

CUDA Tricks and Computational Physics

Kipton Barros

In collaboration with R. Babich, R . Brower, M. Clark, C. Rebbi, J. Ellowitz

Boston University

High energy physicshuge computational needs

27 km

Large Hadron Collider, CERN

A disclaimer:

I’m not a high energy physicist

A request:

Please question/comment freely during the talk

View of the CMS detector at the end of 2007. (Maximilien Brice, © CERN)

View of the Computer Center during the installation of servers. (Maximilien Brice; Claudia Marcelloni, © CERN)

15 Petabytes to be processed annually

The “Standard Model” of Particle Physics

I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve

Big questions: why do quarks appear in groups?physics during big bang?

Quantum ChromoDynamicsThe theory of nuclear

interactions

Extremely difficult:

Must work at the level of fields, not particlesCalculation is quantum mechanical

(bound by “gluons”)

Lattice QCD:Solving Quantum Chromodynamics by Computer

Discretize space and time (place the quarks and gluons on a 4D lattice)

Spacetime = 3+1 dimensions

lattice sites

Quarks live on sites (24 floats each)

Gluons live on links (18 floats each)

Total system sizefloat bytes

lattice sites

quarks gluons

! 324 ! 106

! 4! 324 ! (24 + 4! 18) " 384MB

Lattice QCD:Inner loop requires repeatedly solving linear equation

DW is a sparse matrixwith only nearest neighbor

couplings

quarksgluons

needs to be fast!DW

DWOperation of

1 output quark site(24 floats)

DWOperation of

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

1 output quark site(24 floats)

2x4 input quark sites(24x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

1.4 kB of local storage required per quark update?

Cuda Parallelization:Must process many quark updates simultaneously

Odd/even sites processed separately

© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Friday, January 23, 2009

parallelization:Each thread processes 1 siteNo communication required between threads!All threads in warp execute same code

DW

Step 1: Read neighbor site

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Step 5: Read neighbor link

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

Step 4: Read neighbor site

Step 5: Read neighbor link

Step 6: Accumulate into

79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Exec

Friday, January 23, 2009

!!!

85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Exec

Friday, January 23, 2009

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers

1024 active threads (max)

High occupancy needed for maximum performance (roughly 25% or so)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

24 12 floats

18 floats

24 floats

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

MP occupancy = 64/1024 = 6%

6% occupancysounds pretty

bad!

Andreas Kuehn / Getty

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers

1024 active threads (max)

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers = 64 kb memory

1024 active threads

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

Occupancy > 25%

Registers as data(possible because no inter-thread communication)

Instead of shared memory

Registers are allocated as

Registers as data

Can’t be indexed. All loops must be EXPLICITLY expanded

Code sample

(approx. 1000 LOC automatically generated)

Performance Results:

82 Gigabytes/sec (GTX 280)

44 Gigabytes/sec (Tesla C870)

(completely bandwidth limited)

For comparison:

twice as fast as Cell impl. (arXiv:0804.3654)

20 times faster than CPU implementations

(90 Gflops/s)

0

11.25

22.50

33.75

45.00

≥ 25% 17% 8% 0%

GB/s vs Occupancy

Tesla C870

Surprise! Very robust to low occupancy

0

21.25

42.50

63.75

85.00

≥ 19% 13% 6% 0%

GB/s GB/s

GTX 280

Occupancy Occupancy

Device memory is the bottleneckCoalesced memory accesses crucial

q11 , q12 , ...q124

Quark 1 Quark 2 Quark 3

q21 , q22 , ...q224 q31 , q32 , ...q324 ...

... ...q31q21q11 q12 q22 q32

Data reordering

thread 0 thread 2thread 1 ...

Memory coalescing: store even/odd lattices separately

When memory access isn’t perfectly coalesced

Sometimes float4 arrays can hide latency

This global memory read corresponds to a single CUDA

instruction

thread 0 thread 2thread 1

In case of coalesce miss, at least 4x data is transfered

When memory access isn’t perfectly coalesced

Binding to textures can help

This makes use of the texture cache and can reduce penalty for nearly coalesced accesses

corresponds to a single CUDA instruction

Regarding textures, there are two kinds of memory:

Linear array

Can be modified in kernel

“Cuda array”

Can’t be modifed in kernelGets reordered for 2D, 3D locality

Can only be bound to 1D texture

Allows various hardware features

When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-curve

Wikipedia image

This gives 2D locality

Warnings:

The effectiveness of float4, textures, depends on the CUDA hardware and driver (!)

Certain “magic” access patterns are many times faster than others

Testing appears to be necessary

Memory bandwidth test

Should be optimal

Simple kernel

Memory access completely coalesced

Memory bandwidth test

Simple kernel

Memory access completely coalesced

Bandwidth: 54 Gigabytes / sec(GTX 280, 140 GB/s theoretical!)

So why are NVIDIA samples so fast?

NVIDIA actually uses

54 Gigabytes / sec 102 Gigabytes / sec

(GTX 280, 140 GB/s theoretical)

Naive access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

...

...

Modified access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

...

...

(much more efficient)

CUDA Compiler

CUDAC code

PTXcode

CUDA machinecode

Use unofficial CUDA disassembler to view CUDA machine code

CUDA disassembly

(LOTS of optimization

here)

CUDA Disassembler (decuda)

Compile and save cubin file

foo.cu

Disassemble

Look how CUDA implements integer

division!

CUDA provides fast (but imperfect) trigonometry in hardware!

The compiler is very aggressive in optimization. It will group memory loads together to minimize latency

Notice: each thread reads 20 floats!

(snippet from LQCD)

Recommended