59
Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End CUDA - A Very Short Intro Manuel Werlberger Insitute for Computer Graphics and Vision Graz University of Technology Freiburg, July 22, 2011 Manuel (ICG, TU-Graz) CUDA 22.7.2011 1 / 47

CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

  • Upload
    dodat

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA - A Very Short Intro

Manuel Werlberger

Insitute for Computer Graphics and VisionGraz University of Technology

Freiburg, July 22, 2011

Manuel (ICG, TU-Graz) CUDA 22.7.2011 1 / 47

Page 2: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Why GPUs?

Manuel (ICG, TU-Graz) CUDA 22.7.2011 2 / 47

Page 3: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Resources / Credits

• ‘Best’ introduction:CUDA, Supercomputing for the Masses [Dr.Dobb’s Journal]

• [GP-GPU course @ETHZ]

• NVIDIA Developer Zone[http://developer.nvidia.com]

• NVIDIA CUDA Toolkit includes some pdfs. (programming guide, referenceguide, best practices guide, . . . )

• NVIDIA Guides[http://developer.nvidia.com/nvidia-gpu-computing-documentation]

• Books• CUDA by Example: An Introduction to General-Purpose GPU Programming

(Sanders et al.)• Programming Massively Parallel Processors: A Hands-On Approach (Kirk et

al.) [course slides]

• Webinars

Manuel (ICG, TU-Graz) CUDA 22.7.2011 3 / 47

Page 4: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

History (with NVIDIA subtitles)

2007: CUDA 1.0 (Researcher)

2008: CUDA 2.0 (Scientists and HPC applications)

2009: CUDA 3.0 (Applications)

2011: CUDA 4.0 (‘For the masses’)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 4 / 47

Page 5: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

be aware . . .

NOT EVERYTHINGYOU CAN DO WITH A GPU IS GOOD!

Manuel (ICG, TU-Graz) CUDA 22.7.2011 5 / 47

Page 6: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Outline

1 Introduction

2 The GPU: CUDA ArchitectureGPU ArchitectureMemory ArchitectureProgram Structure

3 Real-World Example

4 Tricks (?)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 6 / 47

Page 7: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CPU / GPU Architecture

! !"#$%&'()*(+,%'-./0%1-,!!

!

!234(!(5'-6'#771,6(8/1.&(9&':1-,(;*<! ! =!!

!

"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/<-$4+)-$)'+=$>!#+3#20!/&%&22$2!.(;/<-&-+()! !$?&.-20!5#&-!3%&/#+.'!%$),$%+)3!+'!&*(<-! !&),!-#$%$1(%$!,$'+3)$,!'<.#!-#&-!;(%$!-%&)'+'-(%'!&%$!,$=(-$,!-(!,&-&!/%(.$''+)3!%&-#$%!-#&)!,&-&!.&.#+)3!&),!12(5!.()-%(2>!&'!'.#$;&-+.&220!+22<'-%&-$,!*0!@+3<%$!A4BC!

!

!

>16/'&()?@*( A"&(852(3&B-%&:(C-'&(A'#,:1:%-':(%-(3#%#(5'-0&::1,6(

!

D(%$!'/$.+1+.&220>!-#$!978!+'!$'/$.+&220!5$224'<+-$,!-(!&,,%$''!/%(*2$;'!-#&-!.&)!*$!$?/%$''$,!&'!,&-&4/&%&22$2!.(;/<-&-+()'! !-#$!'&;$!/%(3%&;!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!+)!/&%&22$2! !5+-#!#+3#!&%+-#;$-+.!+)-$)'+-0! !-#$!%&-+(!(1!&%+-#;$-+.!(/$%&-+()'!-(!;$;(%0!(/$%&-+()'C!E$.&<'$!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!1(%!$&.#!,&-&!$2$;$)->!-#$%$!+'!&!2(5$%!%$F<+%$;$)-!1(%!'(/#+'-+.&-$,!12(5!.()-%(2>!&),!*$.&<'$!+-!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!&),!#&'!#+3#!&%+-#;$-+.!+)-$)'+-0>!-#$!;$;(%0!&..$''!2&-$).0!.&)!*$!#+,,$)!5+-#!.&2.<2&-+()'!+)'-$&,!(1!*+3!,&-&!.&.#$'C!

G&-&4/&%&22$2!/%(.$''+)3!;&/'!,&-&!$2$;$)-'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!D&)0!&//2+.&-+()'!-#&-!/%(.$''!2&%3$!,&-&!'$-'!.&)!<'$!&!,&-&4/&%&22$2!/%(3%&;;+)3!;(,$2!-(!'/$$,!</!-#$!.(;/<-&-+()'C!H)!IG!%$),$%+)3>!2&%3$!'$-'!(1!/+?$2'!&),!=$%-+.$'!&%$!;&//$,!-(!/&%&22$2!-#%$&,'C!J+;+2&%20>!+;&3$!&),!;$,+&!/%(.$''+)3!&//2+.&-+()'!'<.#!&'!/('-4/%(.$''+)3!(1!%$),$%$,!+;&3$'>!=+,$(!$).(,+)3!&),!,$.(,+)3>!+;&3$!'.&2+)3>!'-$%$(!=+'+()>!&),!/&--$%)!%$.(3)+-+()!.&)!;&/!+;&3$!*2(.K'!&),!/+?$2'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!H)!1&.->!;&)0!&23(%+-#;'!(<-'+,$!-#$!1+$2,!(1!+;&3$!%$),$%+)3!&),!/%(.$''+)3!&%$!&..$2$%&-$,!*0!,&-&4/&%&22$2!/%(.$''+)3>!1%(;!3$)$%&2!'+3)&2!/%(.$''+)3!(%!/#0'+.'!'+;<2&-+()!-(!.(;/<-&-+()&2!1+)&).$!(%!.(;/<-&-+()&2!*+(2(30C!

)*@ !234 D(#(8&,&'#E?5/'$-:&(5#'#EE&E(!-7$/%1,6(4'0"1%&0%/'&(H)!L(=$;*$%!BMMN>!LOHGHP!+)-%(,<.$,!68GP &!3$)$%&2!/<%/('$!/&%&22$2!.(;/<-+)3!&%.#+-$.-<%$! !5+-#!&!)$5!/&%&22$2!/%(3%&;;+)3!;(,$2!&),!+)'-%<.-+()!'$-!&%.#+-$.-<%$! !-#&-!2$=$%&3$'!-#$!/&%&22$2!.(;/<-$!$)3+)$!+)!LOHGHP!978'!-(!

"#$%&!

'()!"*+,-*.!

'()!

'()!

'()!

/0'1!

"2)!

/0'1!

!! !!! !!! !!! !!! !!! !!! !!! !

32)!

Manuel (ICG, TU-Graz) CUDA 22.7.2011 7 / 47

Page 8: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Threads, Blocks and Grids

! !"#$%&'()*(+',-'#../0-(1,2&3!!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >>!!

(

?/-8'&()@)<( 1&.,'A(B/&'#'C"A(

)<; B&%&',-&0&,8:(+',-'#../0-("#!$%%&#'()'*+!,-!.$/&(*!0123!'4*!567"!8(9/()::$;/!:9+*%!)##&:*#!'4)'!'4*!567"!'4(*)+#!*<*=&'*!9;!)!84-#$=)%%-!#*8)()'*!!"#$%"!'4)'!98*()'*#!)#!)!=98(9=*##9(!'9!'4*!&'()!(&;;$;/!'4*!5!8(9/():>!?4$#!$#!'4*!=)#*3!@9(!*<):8%*3!A4*;!'4*!B*(;*%#!*<*=&'*!9;!)!CD6!);+!'4*!(*#'!9@!'4*!5!8(9/():!*<*=&'*#!9;!)!5D6>!

!!!!!!!!!!!!!!

"#$%&#!'('$)*!

")+,!-!

.#$/0!123!45!.#$/0!143!45!.#$/0!1-3!45!

.#$/0!123!-5!.#$/0!143!-5!.#$/0!1-3!-5!

")+,!4!

.#$/0!143!45!

.#$/0!143!-5!

.#$/0!143!25!

.#$/0!1-3!45!

.#$/0!1-3!-5!

.#$/0!1-3!25!

67)(&,!.#$/0! !8()9%#$/0!:7&)(,!

'('$)*!

67)(&,!

8()9;7)(&,!#$/&#!'('$)*!

Thread: Execution of a kernel with a certain index. The indexis used to determine the position within an array.

! !"#$%&'()*(+',-'#../0-(1,2&3!!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >>!!

(

?/-8'&()@)<( 1&.,'A(B/&'#'C"A(

)<; B&%&',-&0&,8:(+',-'#../0-("#!$%%&#'()'*+!,-!.$/&(*!0123!'4*!567"!8(9/()::$;/!:9+*%!)##&:*#!'4)'!'4*!567"!'4(*)+#!*<*=&'*!9;!)!84-#$=)%%-!#*8)()'*!!"#$%"!'4)'!98*()'*#!)#!)!=98(9=*##9(!'9!'4*!&'()!(&;;$;/!'4*!5!8(9/():>!?4$#!$#!'4*!=)#*3!@9(!*<):8%*3!A4*;!'4*!B*(;*%#!*<*=&'*!9;!)!CD6!);+!'4*!(*#'!9@!'4*!5!8(9/():!*<*=&'*#!9;!)!5D6>!

!!!!!!!!!!!!!!

"#$%&#!'('$)*!

")+,!-!

.#$/0!123!45!.#$/0!143!45!.#$/0!1-3!45!

.#$/0!123!-5!.#$/0!143!-5!.#$/0!1-3!-5!

")+,!4!

.#$/0!143!45!

.#$/0!143!-5!

.#$/0!143!25!

.#$/0!1-3!45!

.#$/0!1-3!-5!

.#$/0!1-3!25!

67)(&,!.#$/0! !8()9%#$/0!:7&)(,!

'('$)*!

67)(&,!

8()9;7)(&,!#$/&#!'('$)*!

Block: A group of threads. Pretty much no guarantee howthey are executed. Synchronization possible.! !"#$%&'()*(+',-'#../0-(1,2&3!

!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >!!

(

?/-8'&()@A<( 7'/2(,B(C"'&#2(D3,EF:(!

"#$!%&'($)!*+!,#)$-./!0$)!(1*23!-%.!,#$!%&'($)!*+!(1*23/!0$)!4)5.!/0$25+5$.!5%!,#$!!!! """!/6%,-7!2-%!($!*+!,60$!#$%!*)!&#'(8!!"9*:.5'$%/5*%-1!(1*23/!*)!4)5./!2-%!($!/0$25+5$.!-/!5%!,#$!$7-'01$!-(*;$8!

<-2#!(1*23!95,#5%!,#$!4)5.!2-%!($!5.$%,5+5$.!(6!-!*%$:.5'$%/5*%-1=!,9*:.5'$%/5*%-1=!*)!,#)$$:.5'$%/5*%-1!5%.$7!-22$//5(1$!95,#5%!,#$!3$)%$1!,#)*&4#!,#$!(&51,:5%!)*+,-.&/!;-)5-(1$8!"#$!.5'$%/5*%!*+!,#$!,#)$-.!(1*23!5/!-22$//5(1$!95,#5%!,#$!3$)%$1!,#)*&4#!,#$!(&51,:5%!)*+,-0#'!;-)5-(1$8!

<7,$%.5%4!,#$!0)$;5*&/!12%3&&45!$7-'01$!,*!#-%.1$!'&1,501$!(1*23/=!,#$!2*.$!($2*'$/!-/!+*11*9/8!!!"#$%&$'"($)*&*+*,&"--.',/0'--"1,*("20+3((4)',0+"35675678")',0+"95675678"""""""""""""""""""""""")',0+":567567;"<"""""*&+"*"="/',>?@(ABA"C"/',>?D*EBA"F"+G%$0(@(ABAH"""""*&+"I"="/',>?@(ABJ"C"/',>?D*EBJ"F"+G%$0(@(ABJH"""""*)"4*"K"6"LL"I"K"6;""""""""":5*75I7"="35*75I7"F"95*75I7H"

"#$%!

&'()*!+,-!,.!

/0#12%!+3-!3.! /0#12%!+,-!3.! /0#12%!+4-!3.! /0#12%!+5-!3.!

/0#12%!+3-!,.! /0#12%!+,-!,.! /0#12%!+4-!,.! /0#12%!+5-!,.!

/0#12%!+3-!4.! /0#12%!+,-!4.! /0#12%!+4-!4.! /0#12%!+5-!4.!

&'()*!+4-!,.!&'()*!+,-!,.!&'()*!+3-!,.!

&'()*!+4-!3.!&'()*!+,-!3.!&'()*!+3-!3.!

Grid: Group of blocks.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 8 / 47

Page 9: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Threads, Blocks and Grids! !"#$%&'()*(+',-'#../0-(1,2&3!

!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >!!

(

?/-8'&()@A<( 7'/2(,B(C"'&#2(D3,EF:(!

"#$!%&'($)!*+!,#)$-./!0$)!(1*23!-%.!,#$!%&'($)!*+!(1*23/!0$)!4)5.!/0$25+5$.!5%!,#$!!!! """!/6%,-7!2-%!($!*+!,60$!#$%!*)!&#'(8!!"9*:.5'$%/5*%-1!(1*23/!*)!4)5./!2-%!($!/0$25+5$.!-/!5%!,#$!$7-'01$!-(*;$8!

<-2#!(1*23!95,#5%!,#$!4)5.!2-%!($!5.$%,5+5$.!(6!-!*%$:.5'$%/5*%-1=!,9*:.5'$%/5*%-1=!*)!,#)$$:.5'$%/5*%-1!5%.$7!-22$//5(1$!95,#5%!,#$!3$)%$1!,#)*&4#!,#$!(&51,:5%!)*+,-.&/!;-)5-(1$8!"#$!.5'$%/5*%!*+!,#$!,#)$-.!(1*23!5/!-22$//5(1$!95,#5%!,#$!3$)%$1!,#)*&4#!,#$!(&51,:5%!)*+,-0#'!;-)5-(1$8!

<7,$%.5%4!,#$!0)$;5*&/!12%3&&45!$7-'01$!,*!#-%.1$!'&1,501$!(1*23/=!,#$!2*.$!($2*'$/!-/!+*11*9/8!!!"#$%&$'"($)*&*+*,&"--.',/0'--"1,*("20+3((4)',0+"35675678")',0+"95675678"""""""""""""""""""""""")',0+":567567;"<"""""*&+"*"="/',>?@(ABA"C"/',>?D*EBA"F"+G%$0(@(ABAH"""""*&+"I"="/',>?@(ABJ"C"/',>?D*EBJ"F"+G%$0(@(ABJH"""""*)"4*"K"6"LL"I"K"6;""""""""":5*75I7"="35*75I7"F"95*75I7H"

"#$%!

&'()*!+,-!,.!

/0#12%!+3-!3.! /0#12%!+,-!3.! /0#12%!+4-!3.! /0#12%!+5-!3.!

/0#12%!+3-!,.! /0#12%!+,-!,.! /0#12%!+4-!,.! /0#12%!+5-!,.!

/0#12%!+3-!4.! /0#12%!+,-!4.! /0#12%!+4-!4.! /0#12%!+5-!4.!

&'()*!+4-!,.!&'()*!+,-!,.!&'()*!+3-!,.!

&'()*!+4-!3.!&'()*!+,-!3.!&'()*!+3-!3.!

Manuel (ICG, TU-Graz) CUDA 22.7.2011 9 / 47

Page 10: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Where are those executed?

Grid: GPU

Block: Multiprocessor (MP) – GPU is collection of MPs.

Thread: Stream Processor (SP) – Each MP divided into SP.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 10 / 47

Page 11: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Different GPUs:

!

!

!

!

!

!

!"#$%!%&'()'*++,-)%./,01%21'3,(-%456% ! 768!!

$991-0,:%$5%!"#$;<-*=>10%.&"3%

"#$%&!'()!%*+,+!#%%!-./'(&0#$%&1!1&2*3&+!4*,5!,5&*6!3789:,&!3#9#$*%*,;<!0:8$&6!7=!8:%,*9673&++76+<!#01!0:8$&6!7=!-./'!376&+>!

"5&+&<!#+!4&%%!#+!,5&!3%73?!=6&@:&03;!#01!,5&!,7,#%!#87:0,!7=!1&2*3&!8&876;<!3#0!$&!@:&6*&1!:+*0A!,5&!6:0,*8&!76!16*2&6!'BC!D+&&!6&=&6&03&!8#0:#%E>!

?*=>1%$;75% !"#$;<-*=>10%#1@,A13%B,CD%!(+9/C1%!*9*=,>,CEF%G/+=1'%(H%I/>C,9'(A133('3F%*-0%G/+=1'%(H%!"#$%!('13%

! "#$%&'(!")%)*+,+'-!

.&$*(/!#0!1&,'+%/#2(33#/3!

.&$*(/!#0!"456!"#/(3!

.1J('A1%.?K%LM6%?,% N57% O% 8O4%

.1J('A1%.?K%4M6% N57% P% 88M%

.1J('A1%.?K%4P6I% N57% M% NOO%

.1J('A1%.?Q%4L6F%.?K%4M6I% N57% 4% 7RN%

.1J('A1%.?%44LI% N57% 8% 744%

.1J('A1%.?%48LIF%.?%4NLIF%

.?%4N6I%N57% N% RM%

.1J('A1%.?%47LI% N57% 7% 4O%

.1J('A1%.?K%LO6% N56% 7M% L7N%

.1J('A1%.?K%LP6F%.?K%4O6% N56% 7L% 4O6%

.1J('A1%.?K%4P6% N56% 74% 44O%

.1J('A1%.?K%4MLF%.?K%4O6I% N56% 77% 8LN%

.1J('A1%.?K%NRL% 758% N:86% N:N46%

.1J('A1%.?K%NOLF%.?K%NO6F%

.?K!NPL%758% 86% N46%

.1J('A1%.?K%NM6% 758% N4% 7RN%

.1J('A1%RO66%.KN% 757% N:7M% N:7NO%

.1J('A1%.?Q%NL6F%.?Q%7L6F%RO66%.?KF%RO66%.?KSF%OO66%.?Q%L7NF%.?K%NOLIF%.?K!NO6I%

757% 7M% 7NO%

.1J('A1%OO66%">C'*F%OO66%.?K% 756% 7M% 7NO%

.1J('A1%RO66%.?F%OO66%.?F% 757% 74% 77N%

Manuel (ICG, TU-Graz) CUDA 22.7.2011 11 / 47

Page 12: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Outline

1 Introduction

2 The GPU: CUDA ArchitectureGPU ArchitectureMemory ArchitectureProgram Structure

3 Real-World Example

4 Tricks (?)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 12 / 47

Page 13: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory?

! !"#$%&'()*(+',-'#../0-(1,2&3!!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >>!!

(

?/-8'&()@)<( 1&.,'A(B/&'#'C"A(

)<; B&%&',-&0&,8:(+',-'#../0-("#!$%%&#'()'*+!,-!.$/&(*!0123!'4*!567"!8(9/()::$;/!:9+*%!)##&:*#!'4)'!'4*!567"!'4(*)+#!*<*=&'*!9;!)!84-#$=)%%-!#*8)()'*!!"#$%"!'4)'!98*()'*#!)#!)!=98(9=*##9(!'9!'4*!&'()!(&;;$;/!'4*!5!8(9/():>!?4$#!$#!'4*!=)#*3!@9(!*<):8%*3!A4*;!'4*!B*(;*%#!*<*=&'*!9;!)!CD6!);+!'4*!(*#'!9@!'4*!5!8(9/():!*<*=&'*#!9;!)!5D6>!

!!!!!!!!!!!!!!

"#$%&#!'('$)*!

")+,!-!

.#$/0!123!45!.#$/0!143!45!.#$/0!1-3!45!

.#$/0!123!-5!.#$/0!143!-5!.#$/0!1-3!-5!

")+,!4!

.#$/0!143!45!

.#$/0!143!-5!

.#$/0!143!25!

.#$/0!1-3!45!

.#$/0!1-3!-5!

.#$/0!1-3!25!

67)(&,!.#$/0! !8()9%#$/0!:7&)(,!

'('$)*!

67)(&,!

8()9;7)(&,!#$/&#!'('$)*!

Local Memory: Registers. Only accessible from thread level.

Shared Memory: Shared among threads within a MP. Read/write access by anythread from within a MP.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 13 / 47

Page 14: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory?

! !"#$%&'()*(+',-'#../0-(1,2&3!!

!

!456(!(+',-'#../0-(78/2&(9&':/,0(;<=! ! >>!!

(

?/-8'&()@)<( 1&.,'A(B/&'#'C"A(

)<; B&%&',-&0&,8:(+',-'#../0-("#!$%%&#'()'*+!,-!.$/&(*!0123!'4*!567"!8(9/()::$;/!:9+*%!)##&:*#!'4)'!'4*!567"!'4(*)+#!*<*=&'*!9;!)!84-#$=)%%-!#*8)()'*!!"#$%"!'4)'!98*()'*#!)#!)!=98(9=*##9(!'9!'4*!&'()!(&;;$;/!'4*!5!8(9/():>!?4$#!$#!'4*!=)#*3!@9(!*<):8%*3!A4*;!'4*!B*(;*%#!*<*=&'*!9;!)!CD6!);+!'4*!(*#'!9@!'4*!5!8(9/():!*<*=&'*#!9;!)!5D6>!

!!!!!!!!!!!!!!

"#$%&#!'('$)*!

")+,!-!

.#$/0!123!45!.#$/0!143!45!.#$/0!1-3!45!

.#$/0!123!-5!.#$/0!143!-5!.#$/0!1-3!-5!

")+,!4!

.#$/0!143!45!

.#$/0!143!-5!

.#$/0!143!25!

.#$/0!1-3!45!

.#$/0!1-3!-5!

.#$/0!1-3!25!

67)(&,!.#$/0! !8()9%#$/0!:7&)(,!

'('$)*!

67)(&,!

8()9;7)(&,!#$/&#!'('$)*!

Global Memory (DeviceMemory): SDRAM chip. Anythread can read/write to anylocation in device memory.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 14 / 47

Page 15: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory? - Not enough?

Constant Memory: Read-only memory within each MP.

Texture Memory: Global memory can be ‘bound’ to a texture.

• Acts as cache.• Cost-free linear interpolation.• You can set the read mode (e.g. bind a uchar and read a

normalized float).• Can be tricky, but very effective.• You can write back to the bound memory (unsuported).

Manuel (ICG, TU-Graz) CUDA 22.7.2011 15 / 47

Page 16: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory Spaces

Each thread:

• Read/Write per-thread registers

• Read/Write per-thread local mem.

• Read/Write per-block shared mem.

• Read/Write per-grid global memory

• Read-only per-grid constant memory

• Read-only per-grid texture memory

• Read/Write per-grid surface mem.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 16 / 47

Page 17: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 18: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 19: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 20: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 21: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 22: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Memory Types• Global Memory (read and write)

• Slow• Compute capability ≥ 2.0: cached.• Requires sequential & aligned 16/32 byte reads and writes to be fast

(coalesced read/write).

• Texture Memory (read only – write back works but is unsupported)• Cache optimized for 2D spatial access pattern.

• Constant Memory• Constants and kernel arguments.• Slow (device memory), cached.

• Shared Memory (16/48 KB per MP)• Fast. (Take care of bank conflicts → read/write pattern)• Exchange data within a block.• Comment: Border handling not always that straight forward.

• Local Memory (Everything that does not fit into registers)• Slow. Compute capability ≥ 2.0: cached.

• Registeres• Fastest. Scope is thread local.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 17 / 47

Page 23: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Lifetime and Scope

!"#$%&'!()*+$+,-*+%./!

!!!"#$%&%'"()*+0123!0!4#/*!5&-6*+6#/!78+9#! 27:;<=;>:;;?@ABC;!!D!!EB!

>CE 2FGH0F!"F"(IJ!K530FK!

!"#$%&'()*'+%,+'%+'('-./%0'01-2%+3.*'+4%56)*6%6.('%&)77'-'89%*6.-.*9'-)+9)*+%96.9%-'7/'*9%96')-%&)+9)8*9%,+.:'+%)8%!"#$%.33/)*.9)18+;%<6'+'%0'01-2%+3.*'+%)8*/,&'%:/1=./4%/1*./4%+6.-'&4%9'>9,-'4%.8&%-':)+9'-+4%.+%+6158%)8%?):,-'%@;A;%%

%

L+M8&#!>CE! "#$%&'!K)-6#/!%.!-!0123!2#A+6#!

B7%96'+'%&)77'-'89%0'01-2%+3.*'+4%:/1=./%.8&%9'>9,-'%0'01-2%.-'%96'%01+9%3/'89)7,/C%+''%D'*9)18%?;E%17%96'%!"#$%!%&'()'*++,-)%./,01%71-%96'%.01,89+%17%0'01-2%.(.)/.=/'%)8%'.*6%0'01-2%+3.*'%.9%'.*6%*103,9'%*.3.=)/)92%/'('/;%F/1=./4%/1*./4%.8&%9'>9,-'%0'01-2%6.('%96'%:-'.9'+9%.**'++%/.9'8*24%71//15'&%=2%*18+9.89%0'01-24%-':)+9'-+4%.8&%+6.-'&%0'01-2;%

<6'%(.-)1,+%3-)8*)3./%9-.)9+%17%96'%0'01-2%923'+%.-'%+6158%)8%<.=/'%@;E;%

N-OP#!>C?! K-P+#.*!L#-*8&#/!%Q!2#A+6#!"#$%&'!

,-*)./+ 0)('1%)#+)#2)33+(4%5+

6'(4-&+ 7((-88+ 9()5-+ 0%3-1%*-+

I#M+/*#&! (.! .R-! IRS! ?!*T&#-9! NT&#-9!

U%6-P! (QQ! ! IRS! ?!*T&#-9! NT&#-9!

KT-&#9! (.! .R-! IRS! 3PP!*T&#-9/!+.!OP%6V! 4P%6V!

7P%O-P! (QQ! ! IRS! 3PP!*T&#-9/!W!T%/*! X%/*!-PP%6-*+%.!

0%./*-.*! (QQ! J#/! I! 3PP!*T&#-9/!W!T%/*! X%/*!-PP%6-*+%.!

N#Y*8&#! (QQ! J#/! I! 3PP!*T&#-9/!W!T%/*! X%/*!-PP%6-*+%.!

!.*6'&%18/2%18%&'()*'+%17%*103,9'%*.3.=)/)92%A;>;%

<1%G1+9%

Manuel (ICG, TU-Graz) CUDA 22.7.2011 18 / 47

Page 24: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory Access

! !

!"#$%&'()%*+,-./.)01

! 23)0)%.1%,%4,0.,5-)%,66)11)78

!,9%3:1;%,66)11%.;8

<)=>=%4.,%6+7,?)@6('A

#)6-,0)7%:+;1.7)%:/%

,9'%B+96;.:9#)6-,0)7%.9%;3)%C)09)-

')1 9:!!"#$%&#!!

!!'$()*&(*!!

0)>.1;)0%<,+;:@,;.6A

!!)+&,-.!!

!!#$'&#!!

Manuel (ICG, TU-Graz) CUDA 22.7.2011 19 / 47

Page 25: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Outline

1 Introduction

2 The GPU: CUDA ArchitectureGPU ArchitectureMemory ArchitectureProgram Structure

3 Real-World Example

4 Tricks (?)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 20 / 47

Page 26: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

An Example Scenario

• Load data from global memory.

• Do thread local computations.

• Store results to global memory.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 21 / 47

Page 27: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Simple Example

1 // k e r n e l : s i m p l e s q r example2 __global__ void cuSqrKernel ( float∗ device_memory ,3 const size_t stride , const int width , const int height )4 5 int x = blockIdx . x∗blockDim . x + threadIdx . x ;6 int y = blockIdx . y∗blockDim . y + threadIdx . y ;7 int c = y∗stride+x ;8

9 float reg = device_memory [ c ] ;10 float result = reg∗reg ;11 device_memory [ c ] = result ;12

Manuel (ICG, TU-Graz) CUDA 22.7.2011 22 / 47

Page 28: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Simple Example

1 // wrapper : s i m p l e s q r example2 IuStatus cuSqr ( iu : : ImageGpu_32f_C1∗ image )3 4 // f r a g m e n t a t i o n5 unsigned int block_size = 3 2 ;6 dim3 dimBlock ( block_size , block_size ) ;7 dim3 dimGrid ( dst−>width ( ) /dimBlock . x ,8 dst−>height ( ) /dimBlock . y ) ;9

10 float∗ device_memory = image−>data ( ) ;11

12 cuSqrKernel <<< dimGrid , dimBlock >>> (13 device_memory ,14 image−>stride ( ) , image−>width ( ) , image−>height ( ) ) ;15

Manuel (ICG, TU-Graz) CUDA 22.7.2011 23 / 47

Page 29: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Simple Example - Memory?

1 // a l l o c2 cudaError_t status ;3 float∗ buffer = 0 ;4 size_t pitch = 0 ; // number o f row e l e m e n t s i n b y t e s5 status = cudaMallocPitch ( ( void ∗∗)&buffer , &pitch ,6 width∗sizeof ( float ) , height ) ;7

8 if ( status == cudaSuccess )9 printf ("hurray everything worked\n" ) ;

10 else

11 printf ("problem.... out of memory?\n" ) ;12

13 // f r e e14 status = cudaFree ( ( void ∗) buffer ) ;

Manuel (ICG, TU-Graz) CUDA 22.7.2011 24 / 47

Page 30: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Simple Example - How to get data onto GPU?

1 // copy h o s t to d e v i c e2 /∗3 PREREQUISITES :4 f l o a t ∗ host memory5 and6 f l o a t ∗ dev ice memory7 o f s i z e width , h e i g h t8 and row l e n g t h p i t c h [ b y t e s ]9 ∗/

10

11 // copy h o s t −> d e v i c e12 cudaError_t status ;13 status = cudaMemcpy2D ( device_memory , dev_pitch ,14 host_memory , host_pitch ,15 width∗sizeof ( float ) , height ,16 cudaMemcpyHostToDevice ) ;

Manuel (ICG, TU-Graz) CUDA 22.7.2011 25 / 47

Page 31: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

(Self-Pormotion)

If you just want to use CUDA and don’t want to bother too much about all this?→ ImageUtilities:

• Simplifies a lot of things.

• Modular. (core, math, filter, I/O, GUI, Matlab/IPP connector, . . . )

• open source (LGPL)

• bad: Documentation (comming next)

• BUT: If you want to use CUDA really effitiently it is very important to knowwhich memory works how and what it is for.

• currently: http://gitorious.org/imageutilities→ will soon move to google code (with install instruction and somedocumentation).

Manuel (ICG, TU-Graz) CUDA 22.7.2011 26 / 47

Page 32: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

ImageUtilties Example (e.g. I/O Module)

1 iu : : ImageGpu_32f_C1∗ image = iu : : imread_cu32f_C1 ("im.png" ) ;2 iu : : ImageGpu_32f_C1∗ edges =3 new iu : : ImageGpu_32f_C1 ( image−>size ( ) ) ;4

5 iu : : filterEdge ( image , edges , image−>roi ( ) ) ;6

7 iu : : imshow ( image , "show input device image" ) ;8 iu : : imshow ( edges , "show edge image" ) ;9 iu : : imsave ( edges , "edges.png" ) ;

10

11 // c a l l k e r n e l ?12 cuSqrKernel <<< dimGrid , dimBlock >>> (13 edges−>data ( ) , edges−>stride ( ) ,14 edges−>width ( ) , edges−>height ( ) ) ;

Manuel (ICG, TU-Graz) CUDA 22.7.2011 27 / 47

Page 33: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA Tools

For CUDA ≥ 4.0:

• Trust: STL for CUDA (actually nvcc is (was?) a bit picky about real STLstuff)

• NPP: Some basic image processing functionality.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 28 / 47

Page 34: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

How can we make use of that?

Manuel (ICG, TU-Graz) CUDA 22.7.2011 29 / 47

Page 35: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

ROF Denoising

TV-L2 (ROF) Model

minu

∫Ω

|∇u|+ λ

2(u− f)

2dx

Primal-dual formulation

minu

maxp−〈u, div p〉+

λ

2‖u− f‖22 − δP (p)

Update equations:

• pn+1 = ΠB[pn + σ∇(2un − un−1)

]• un+1 = (un − τ (div (p) + λf)) / (1 + τλ)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 30 / 47

Page 36: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Matlab

1 for i=1:num_iter2 % d u a l update3 p ( : , : , 1 ) = p ( : , : , 1 ) + sigma∗dxp (u ) ;4 p ( : , : , 2 ) = p ( : , : , 2 ) + sigma∗dyp (u ) ;5

6 reprojection = max ( 1 . 0 , sqrt (p ( : , : , 1 ) . ˆ 2 + p ( : , : , 2 ) . ˆ 2 ) ) ;7 p ( : , : , 1 ) = p ( : , : , 1 ) . / reprojection ;8 p ( : , : , 2 ) = p ( : , : , 2 ) . / reprojection ;9

10 % p r i m a l update11 u_ = u ;12 div = dxm (p ( : , : , 1 ) ) + dym (p ( : , : , 2 ) ) ;13 u = (u + tau ∗( div + lambda∗f ) ) ./(1+ tau∗lambda ) ;14

15 % l e a d i n g−p o i n t s t e p16 u_ = 2∗u−u_ ;17 end

Manuel (ICG, TU-Graz) CUDA 22.7.2011 31 / 47

Page 37: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

CUDA

Manuel (ICG, TU-Graz) CUDA 22.7.2011 32 / 47

Page 38: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Tricks? . . .

. . . might be only some practice that we got used to . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 33 / 47

Page 39: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory again . . .

• Reduce read/write from global memory as much as you can.

• Use aligned memory:• Recall: Sequential and aligned 16/32 byte reads and writes needed to be fast.• → coalesced read/write• Memory allocated with cudaMallocPitch is aligned correctly.• At the beginning pitched memory might be a bit confusing→ one gets used to.• Tip: differentiate between elements in a row (stride) and number of bytes

(pitch) in a row and stick to notation.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 34 / 47

Page 40: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory again . . .

• Reduce read/write from global memory as much as you can.

• Use aligned memory:• Recall: Sequential and aligned 16/32 byte reads and writes needed to be fast.• → coalesced read/write• Memory allocated with cudaMallocPitch is aligned correctly.• At the beginning pitched memory might be a bit confusing→ one gets used to.• Tip: differentiate between elements in a row (stride) and number of bytes

(pitch) in a row and stick to notation.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 34 / 47

Page 41: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory again . . .

• Reduce memory transfer between host and device.

• Sometimes it is better to keep memory on device and compute everytingthere (even if routines are not faster than on CPU but memory transfer wouldslow down everything).

• Keep the memory on the GPU!

• At the beginning of CUDA: Every new GPU generation → twice as muchprocessors → program run twice as fast as before. (Now often memorytransfers are the bottleneck.)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 35 / 47

Page 42: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Memory again . . .

• Reduce memory transfer between host and device.

• Sometimes it is better to keep memory on device and compute everytingthere (even if routines are not faster than on CPU but memory transfer wouldslow down everything).

• Keep the memory on the GPU!

• At the beginning of CUDA: Every new GPU generation → twice as muchprocessors → program run twice as fast as before. (Now often memorytransfers are the bottleneck.)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 35 / 47

Page 43: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Multiple GPUs?

• Note for multi-GPU users:• Until CUDA 4.0 memory transfer from one GPU to the other was done via

host memory.• Since CUDA 4.0 → unified adress memory.• All GPUs can access all the combined memory.• Still a read/write to ‘off-card’ memory is slow because it is going over the

PCIe bus.• (If you meet anyone working at NVIDIA: Tell them every time why SLI is only

used for gaming . . . ;-) )

Manuel (ICG, TU-Graz) CUDA 22.7.2011 36 / 47

Page 44: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Texture memory

• Correct/Supported: Read-only cudaArray is bound to texture.

• Correct/Supported: Read-only cudaArray is bound to surface.

• For CUDA ≥ 4.0: Write support for surfaces.

• BUT: still cudaArray is read only.

• Solution?• For CUDA ≥ 3.?: Possibility to bind pitched memory to texture.• Texture read-only.• BUT: one can write to pitched memory (that is bound to texture).• → working well (not officially supported)• You should know what you are doing . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 37 / 47

Page 45: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Texture memory

• Correct/Supported: Read-only cudaArray is bound to texture.

• Correct/Supported: Read-only cudaArray is bound to surface.

• For CUDA ≥ 4.0: Write support for surfaces.

• BUT: still cudaArray is read only.

• Solution?

• For CUDA ≥ 3.?: Possibility to bind pitched memory to texture.• Texture read-only.• BUT: one can write to pitched memory (that is bound to texture).• → working well (not officially supported)• You should know what you are doing . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 37 / 47

Page 46: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Texture memory

• Correct/Supported: Read-only cudaArray is bound to texture.

• Correct/Supported: Read-only cudaArray is bound to surface.

• For CUDA ≥ 4.0: Write support for surfaces.

• BUT: still cudaArray is read only.

• Solution?• For CUDA ≥ 3.?: Possibility to bind pitched memory to texture.

• Texture read-only.• BUT: one can write to pitched memory (that is bound to texture).• → working well (not officially supported)• You should know what you are doing . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 37 / 47

Page 47: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Texture memory

• Correct/Supported: Read-only cudaArray is bound to texture.

• Correct/Supported: Read-only cudaArray is bound to surface.

• For CUDA ≥ 4.0: Write support for surfaces.

• BUT: still cudaArray is read only.

• Solution?• For CUDA ≥ 3.?: Possibility to bind pitched memory to texture.• Texture read-only.

• BUT: one can write to pitched memory (that is bound to texture).• → working well (not officially supported)• You should know what you are doing . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 37 / 47

Page 48: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Texture memory

• Correct/Supported: Read-only cudaArray is bound to texture.

• Correct/Supported: Read-only cudaArray is bound to surface.

• For CUDA ≥ 4.0: Write support for surfaces.

• BUT: still cudaArray is read only.

• Solution?• For CUDA ≥ 3.?: Possibility to bind pitched memory to texture.• Texture read-only.• BUT: one can write to pitched memory (that is bound to texture).• → working well (not officially supported)• You should know what you are doing . . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 37 / 47

Page 49: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Interpolation and Border Handling

• If you need any of the two:

• Use textures.

• Linear interpolation at no cost.

• This is actually a thing that GPUs are built for.

• Hopefully to see more features coming from DirectX/OpenGL liketessellation.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 38 / 47

Page 50: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Program structure

• Keep it simple. (Dejavu from any programming course?)

• Modularity not that simple because CUDA lacks some important C++features.

• If you are building simple libraries use device memory as input and leave thememory management to the user.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 39 / 47

Page 51: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Complex Kernels

• If you run out of registers avoid local memory as much as you can.

• Use shared memory instead:

1 __shared__ float u [ BLOCKSIZE_X ] [ BLOCKSIZE_Y ] ;2 u [ threadIdx . x ] [ threadIdx . y ] = u ;3 __syncthreads ( ) ;

• syncthreads as a thread barrier. Execution stops until every thread withinthe current block reaches this position.

Manuel (ICG, TU-Graz) CUDA 22.7.2011 40 / 47

Page 52: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Border Handling for shared memory?

1 // Image c o o r d i n a t e2 int x = blockIdx . x∗blockDim . x + threadIdx . x ;3 int y = blockIdx . y∗blockDim . y + threadIdx . y ;4

5 // Thread i n d e x6 int tx = threadIdx . x ;7 int ty = threadIdx . y ;8

9 // D e f i n e a r r a y s f o r s h a r e d memory10 __shared__ float u_shared [ BLOCK_SIZE_X +1][ BLOCK_SIZE_Y +1];11

12 // l o a d data i n t o s h a r e d memory13 u_shared [ ty ] [ tx ] = u_global [ c ] ;14

15 __syncthreads ( ) ;

Manuel (ICG, TU-Graz) CUDA 22.7.2011 41 / 47

Page 53: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Border Handling for shared memory?

1 if (x >= width−1)2 u_shared [ ty ] [ tx+1] = u_shared [ ty ] [ tx ] ;3 else if ( tx == BLOCK_SIZE−1)4 u_shared [ ty ] [ tx+1] = u_global [ c+1];5

6 if (y >= height−1)7 u_shared [ ty+1][ tx ] = u_shared [ ty ] [ tx ] ;8 else if ( ty == BLOCK_SIZE−1)9 u_shared [ ty+1][ tx ] = u_global [ c+p ] ;

10

11 float u_x = u_shared [ ty ] [ tx+1] − u_shared [ ty ] [ tx ] ;12 float u_y = u_shared [ ty+1][ tx ] − u_shared [ ty ] [ tx ] ;

Manuel (ICG, TU-Graz) CUDA 22.7.2011 42 / 47

Page 54: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Cuda Profiler

Manuel (ICG, TU-Graz) CUDA 22.7.2011 43 / 47

Page 55: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Cuda Profiler

Manuel (ICG, TU-Graz) CUDA 22.7.2011 44 / 47

Page 56: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

What about Debugging?

• cuda-gdb available for Linux and MacOS.

• cuda-memcheck.

• For Windows: NVIDIA Parallel Nsight.

• How do I debug? Display intermediate results and hope everything works fine.

• Restart from time to time. (This got much better.)

• Suggestion for heavy users: Use 2 GPUs (1 display, 1 computing card)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 45 / 47

Page 57: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

What about Debugging?

• cuda-gdb available for Linux and MacOS.

• cuda-memcheck.

• For Windows: NVIDIA Parallel Nsight.

• How do I debug? Display intermediate results and hope everything works fine.

• Restart from time to time. (This got much better.)

• Suggestion for heavy users: Use 2 GPUs (1 display, 1 computing card)

Manuel (ICG, TU-Graz) CUDA 22.7.2011 45 / 47

Page 58: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Manuel (ICG, TU-Graz) CUDA 22.7.2011 46 / 47

Page 59: CUDA - A Very Short Intro - Werlberger · Graz University of Technology Introduction The GPU: CUDA Architecture Real-World Example Tricks (?)End CUDA - A Very Short Intro Manuel Werlberger

Graz University of Technology

Introduction The GPU: CUDA Architecture Real-World Example Tricks (?) End

Thank you very much for your attention.

Discussion. . .

Manuel (ICG, TU-Graz) CUDA 22.7.2011 47 / 47