IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton...

CUDA Tricks and Computational Physics

Kipton Barros

In collaboration with R. Babich, R . Brower, M. Clark, C. Rebbi, J. Ellowitz

Boston University

High energy physicshuge computational needs

Large Hadron Collider, CERN

A disclaimer:

I’m not a high energy physicist

A request:

Please question/comment freely during the talk

View of the CMS detector at the end of 2007. (Maximilien Brice, © CERN)

View of the Computer Center during the installation of servers. (Maximilien Brice; Claudia Marcelloni, © CERN)

15 Petabytes to be processed annually

The “Standard Model” of Particle Physics

I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve

Big questions: why do quarks appear in groups?physics during big bang?

Quantum ChromoDynamicsThe theory of nuclear

interactions

Extremely difficult:

Must work at the level of fields, not particlesCalculation is quantum mechanical

(bound by “gluons”)

Lattice QCD:Solving Quantum Chromodynamics by Computer

Discretize space and time (place the quarks and gluons on a 4D lattice)

Spacetime = 3+1 dimensions

lattice sites

Quarks live on sites (24 floats each)

Gluons live on links (18 floats each)

Total system sizefloat bytes

lattice sites

quarks gluons

! 324 ! 106

! 4! 324 ! (24 + 4! 18) " 384MB

Lattice QCD:Inner loop requires repeatedly solving linear equation

DW is a sparse matrixwith only nearest neighbor

couplings

quarksgluons

needs to be fast!DW

DWOperation of

1 output quark site(24 floats)

DWOperation of

2x4 input quark sites(24x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

DWOperation of

2x4 input gluon links(18x8 floats)

1.4 kB of local storage required per quark update?

Cuda Parallelization:Must process many quark updates simultaneously

Odd/even sites processed separately

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Friday, January 23, 2009

parallelization:Each thread processes 1 siteNo communication required between threads!All threads in warp execute same code

Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Reminder -- each multiprocessor has:

16 kb shared memory

16 k registers

1024 active threads (max)

High occupancy needed for maximum performance (roughly 25% or so)

: does it fit onto the GPU?DW

Each thread requires 0.2 kb1.4 kbof fast local memory

24 12 floats

18 floats

24 floats

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

MP has

Threads/MP = 16 / 0.2 = 80

16 kb shared mem

64(multiple of 64 only)

MP occupancy = 64/1024 = 6%

6% occupancysounds pretty

Andreas Kuehn / Getty

16 kb shared memory

16 k registers

1024 active threads (max)

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

16 kb shared memory

16 k registers = 64 kb memory

1024 active threads

Each thread requires 0.2 kbof fast local memory

How can we get better occupancy?

Occupancy > 25%

Registers as data(possible because no inter-thread communication)

Instead of shared memory

Registers are allocated as

Registers as data

Can’t be indexed. All loops must be EXPLICITLY expanded

Code sample

(approx. 1000 LOC automatically generated)

Performance Results:

82 Gigabytes/sec (GTX 280)

44 Gigabytes/sec (Tesla C870)

(completely bandwidth limited)

For comparison:

twice as fast as Cell impl. (arXiv:0804.3654)

20 times faster than CPU implementations

(90 Gflops/s)

≥ 25% 17% 8% 0%

GB/s vs Occupancy

Tesla C870

Surprise! Very robust to low occupancy

≥ 19% 13% 6% 0%

GB/s GB/s

GTX 280

Occupancy Occupancy

Device memory is the bottleneckCoalesced memory accesses crucial

q11 , q12 , ...q124

Quark 1 Quark 2 Quark 3

q21 , q22 , ...q224 q31 , q32 , ...q324 ...

... ...q31q21q11 q12 q22 q32

Data reordering

thread 0 thread 2thread 1 ...

Memory coalescing: store even/odd lattices separately

When memory access isn’t perfectly coalesced

Sometimes float4 arrays can hide latency

This global memory read corresponds to a single CUDA

instruction

thread 0 thread 2thread 1

In case of coalesce miss, at least 4x data is transfered

When memory access isn’t perfectly coalesced

Binding to textures can help

This makes use of the texture cache and can reduce penalty for nearly coalesced accesses

corresponds to a single CUDA instruction

Regarding textures, there are two kinds of memory:

Linear array

Can be modified in kernel

“Cuda array”

Can’t be modifed in kernelGets reordered for 2D, 3D locality

Can only be bound to 1D texture

Allows various hardware features

When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-curve

Wikipedia image

This gives 2D locality

Warnings:

The effectiveness of float4, textures, depends on the CUDA hardware and driver (!)

Certain “magic” access patterns are many times faster than others

Testing appears to be necessary

Memory bandwidth test

Should be optimal

Simple kernel

Memory access completely coalesced

Memory bandwidth test

Simple kernel

Memory access completely coalesced

Bandwidth: 54 Gigabytes / sec(GTX 280, 140 GB/s theoretical!)

So why are NVIDIA samples so fast?

NVIDIA actually uses

54 Gigabytes / sec 102 Gigabytes / sec

(GTX 280, 140 GB/s theoretical)

Naive access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

Modified access pattern

Block 1

...Step 1

Block 2

Block 1

...Step 2

Block 2

(much more efficient)

CUDA Compiler

CUDAC code

PTXcode

CUDA machinecode

Use unofficial CUDA disassembler to view CUDA machine code

CUDA disassembly

(LOTS of optimization

CUDA Disassembler (decuda)

Compile and save cubin file

foo.cu

Disassemble

Look how CUDA implements integer

division!

CUDA provides fast (but imperfect) trigonometry in hardware!

The compiler is very aggressive in optimization. It will group memory loads together to minimize latency

Notice: each thread reads 20 floats!

(snippet from LQCD)

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton...

Education

Jared Law CUDA: Super-Computing Made Easy. Jared Law NVidia CUDA: Why CUDA? What is CUDA? Where/how is CUDA being used? What does CUDA mean to programmers?

Debugging Experience with CUDA-GBD and CUDA-MEMCHECK · 2012-11-27 · Debugging Experience with CUDA-GDB and CUDA-MEMCHECK ... CUDA Debugging Solutions C UDA-G DB (Linux & Mac) C

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC€¦ · Steve Abbott, February 12, 2019 GPUDIRECT, CUDA AWARE MPI,& CUDA IPC

Code gpu with cuda - CUDA introduction

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

Chapter 1. Introduction - POLI's homepagepoli.cs.vsb.cz/edu/apps/cuda/cuda-programming.pdf · CUDA C Programming Guide Version 4.0 1 ... NVIDIA introduced CUDA™, ... Chapter 1

GPUDIRECT, CUDA AWARE MPI, & CUDA IPC...Steve Abbott, Summit Training Workshop, December 2018 GPUDIRECT, CUDA AWARE MPI, & CUDA IPC

IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)

CUDA programming Performance considerations (CUDA best practices)

Tuning CUDA Applications for Keplercseweb.ucsd.edu/.../static/cuda-5.5-doc/pdf/Kepler_Tuning_Guide.pdf · Tuning CUDA Applications for Kepler DA-06288-001_v5.5 ... and the CUDA C

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock

NVIDIA CUDA Best Practices Guide - Virginia Tech€¦ · CUDA Best Practices Guide Version 3.1 Version 3.1 5/19/2010 NVIDIA CUDA™ NVIDIA CUDA C Best Practices Guide . ... CUDA Programming

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

CUDA Lecture 8 CUDA Memories

IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)

An#Introduction#to#CUDA/OpenCL# …parlab.eecs.berkeley.edu/sites/all/parlab/files/CatanzaroIntroToG... · Mapping#CUDA#to#Nvidia#GPUs#! ... Introduction to CUDA! CUDA Programming

v5.0 | October 2012 NVIDIA CUDA SAMPLES Release Notesdirac.ruc.dk/manuals/cuda-5.0/CUDA_Samples_Release_Notes.pdf · NVIDIA CUDA Samples v5.0 | ii CUDA SAMPLES 5.0 NOTES R304 Driver

GPGPU programming on example of CUDA - Panoramix - …panoramx.ift.uni.wroc.pl/~maq/cuda/prezentacja-cuda-eng.pdf · CPU GPU CUDA Architecture GPU programming Examples Summary GPGPU

Debugging Experience with CUDA-GDB and CUDA …developer.download.nvidia.com/...GTC2012-Debugging...Debugging Experience with CUDA-GDB and CUDA-MEMCHECK Geoff Gerfin Vyas Venkataraman