46
GPU Architecturs and Execution Mark Greenstreet CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are made available under the terms of the Creative Commons Attribution 4.0 International license http://creativecommons.org/licenses/by/4.0/ Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 1 / 32

GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

GPU Architecturs and Execution

Mark Greenstreet

CpSc 418 – October 30, 2020

Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and aremade available under the terms of the Creative Commons Attribution 4.0 International licensehttp://creativecommons.org/licenses/by/4.0/

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 1 / 32

Page 2: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Outline

GPU architectureI The big pictureI Multiple pipelinesI No pipeline bypassesI Lots of threadsI Watch out for memory bottlenecks!

Coding examples (if we have time).I Program structure: slide 17I Memory: slide 19I A simple example: slide 20I Launching kernels: slide 27

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 2 / 32

Page 3: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

A 2018 GPU

SM

GDDR

GDDR

GD

DR

GD

DR

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SML2$

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Each SM is a SIMD pipelineas shown on the previousslide.

I SIMD = Single-InstructionMultiple-Data

A 2018 GPU (pretty similartoday):

I 80 SMsI 64 pipelines per SMI HBM2 memory – 32

Gbytes, 900GBytes/sec

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 3 / 32

Page 4: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

A “Streaming Multiprocessor”

warp

F

Iregisters

lots ofdeep pipeline

(20−30 stages)

registerslots of

deep pipeline

(20−30 stages)

registerslots of

deep pipeline

(20−30 stages)

shared

memory

shared

memory

shared

memory

crossbar

switch

I$

PCDEC

ctrl

scheduler

inst inst

Each instruction is executed in parallel by all of the pipelines of theSM.We’ll review pipelined processors and then return to how pipelinesare implemented on a GPU.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 4 / 32

Page 5: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Matrix Multiplication: Dependencies

ld a[i,k]

for(k=0; k < L; k++)

sum += a[i,k]*b[k,j];

ld b[k,j]

* k++

+= <

ld a[i,k] ld b[k,j]

* k++

+= <

Matrix multiplication – innner loop.Dependencies:

I →: RAWI →: WARI →: CTRLI Not showing dependencies that

are dominated by the ones drawn.

F E.g. not showing WAW fromk++ to k++ in next iteration.

Maximum speed up = 2I Assuming simple architecture: no

renaming or branch prediction.I This is the GPU approach:

simple⇒less energy per operation⇒higher throughput per GPU.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 5 / 32

Page 6: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Matrix Multiplication on a RISC Pipeline

k++

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

IF DEC ALU MEM WB

ld a

ld b ld a

k++ ld b ld a

* k++ ld b ld a

* k++ ld b ld ab<

+= b<

BYPASS: a, b

* k++ ld b BYPASS: k

+=ld a BYPASS: a*bb< *

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 6 / 32

Page 7: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Data Parallelism

for(j=0; j < M; j++)

* k++

+= <

ld a[i,k]

* k++

+= <

ld a[i,k]

ld b[k,0]

ld b[k,0]

* k++

+= <

ld a[i,k]

* k++

+= <

ld a[i,k]

ld b[k,1]

ld b[k,1]

* k++

+= <

ld a[i,k]

* k++

+= <

ld a[i,k]

ld b[k,2]

ld b[k,2]

j=1 j=2j=0

for(k=0; k < L; k++)

sum += a[i,k]*b[k,j];

The iterations of the j-loop are independent.Maximum speed-up = 2M.Even more speed-up if we exploit the i-loop in the same way.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 7 / 32

Page 8: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

ld a

IF DEC ALU MEM WB

j=0

j=1

j=2

j=3

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 9: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

# delay

IF DEC ALU MEM WB

ld b ld a

j=0

j=1

j=2

j=3

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+=

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 10: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

ld a

IF DEC ALU MEM WB

j=0

j=1

j=2

j=3

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

k++ ld b ld a

ld a

ld a

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 11: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

k++

IF DEC ALU MEM WB

ld b ld a

ld b ld a

ld b ld a

ld b ld a j=0

j=1

j=2

j=3

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

*

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 12: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

b<

IF DEC ALU MEM WB

k++ ld b ld a

k++ ld b ld a

k++ ld b ld a

k++ ld b ld a j=0

j=1

j=2

j=3

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

*

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.Bypasses require fast data routing⇒ bypasses are energy hogs.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 13: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

j=3

IF DEC ALU MEM WB

* k++ ld b

* k++ ld b

* k++ ld b

* k++ ld b

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

+= b<

j=0

j=1

j=2

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.Bypasses require fast data routing⇒ bypasses are energy hogs.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 14: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

SIMD Execution

ld a

IF DEC ALU MEM WB

b< * k++

b< * k++

b< * k++

b< * k++

loop: ld a[i,k]

ld b[k,j]

k++

*b< loop

+= # delay

j=0

j=1

j=2

j=3+=

SIMD = Single Instruction, Multiple DataAll execution pipelines (i.e. ALU through WB) share the same IFand DEC stages.Bypasses require fast data routing⇒ bypasses are energy hogs.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 8 / 32

Page 15: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreading: Registers

thread

3

k

a

b

sum

k

a

b

sum

k

a

b

sum

k

a

b

sum

$0

$2

$3

$4

$5

$6

$7

$8$9

$1

$a

$b

$c

$d

$f

$e

thread0

thread

1th

read

2

Large, physical register file.Each thread uses a portion of the registers.

I E.g. The thread-id selects an offset into theregister file.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 9 / 32

Page 16: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded Execution on a RISC

+=

IF DEC ALU MEM WB

a.0

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

Fetch for thread 0

No need for bypasses – just use more threads.No need for delay slots or branch prediction – just use morethreads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 10 / 32

Page 17: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded Execution on a RISC

+=

IF DEC ALU MEM WB

a.1

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

Fetch for thread 1

a.0

No need for bypasses – just use more threads.No need for delay slots or branch prediction – just use morethreads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 10 / 32

Page 18: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded Execution on a RISC

a.2

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

Fetch for thread 2

+=

a.1 a.0

No need for bypasses – just use more threads.No need for delay slots or branch prediction – just use morethreads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 10 / 32

Page 19: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded Execution on a RISC

thread 3

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

Fetch for

+=

a.1 a.0a.2a.3

No need for bypasses – just use more threads.No need for delay slots or branch prediction – just use morethreads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 10 / 32

Page 20: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded Execution on a RISC

b.0

Fetch for thread 0

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

a.1 a.0a.2a.3

No need for bypasses – just use more threads.No need for delay slots or branch prediction – just use morethreads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 10 / 32

Page 21: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Registers on an SM

$e

k

a

b

sum

k

a

b

sum

k

a

b

sum

k

a

b

sum

SP=0

k

a

b

sum

k

a

b

sum

k

a

b

sum

k

a

b

sum

SP=1

k

a

b

sum

k

a

b

sum

k

a

b

sum

k

a

b

sum

SP=2

k

a

b

sum

k

a

b

sum

k

a

b

sum

k

a

b

sum

SP=3

warp0

wa

rp1

wa

rp2

wa

rp3

$0

$2

$3

$4

$5

$6

$7

$8$9

$1

$a

$b

$c

$d

$f

ThreadId = WarpId ∗ threadswarp

+ ThreadPos

ThreadPos is thread positionwithin a warp.For simplicity, we will alwaysassume threads

warp = #SPs#SM .

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 11 / 32

Page 22: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

What’s a Warp?

Warp is a term fromweaving – a warp is anarrangment of threads.

Figure from http://commons.wikimedia.org/wiki/User:Ryj

In CUDA, a warp is the programmer’s abstraction of the underlying architecturewhere a group of threads execute in lockstep, in parallel, on the SPs.The architects may perform optimizations so that the number of SPs in an SM is notnecessarily the same as the number of threads in a warp.However, if you think of warps as being threads that execute together across theSPs, you will have a very good model for writing CUDA code.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 12 / 32

Page 23: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded+SIMD Execution on a GPU

SP=3

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

a.0 SP=0

SP=1

SP=2

Instruction a.0 means execute ld a[i,k] for warp 0.Data parallelism:i = ThreadIdx.y;j = ThreadIdx.x;RawIdx = M*i + j; // = WarpId + SP

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 13 / 32

Page 24: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded+SIMD Execution on a GPU

a.1

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

SP=0

SP=1

SP=2

SP=3

a.0

Instruction a.0 means execute ld a[i,k] for warp 0.Data parallelism:i = ThreadIdx.y;j = ThreadIdx.x;RawIdx = M*i + j; // = WarpId + SP

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 13 / 32

Page 25: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded+SIMD Execution on a GPU

a.0

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

SP=0

SP=1

SP=2

SP=3

a.1 a.0a.2

a.0

a.0

Instruction a.0 means execute ld a[i,k] for warp 0.Data parallelism:i = ThreadIdx.y;j = ThreadIdx.x;RawIdx = M*i + j; // = WarpId + SP

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 13 / 32

Page 26: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded+SIMD Execution on a GPU

a.3

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

SP=0

SP=1

SP=2

SP=3

a.1 a.0a.2

a.0

a.0

a.0

a.1

a.1

a.1

Instruction a.0 means execute ld a[i,k] for warp 0.Data parallelism:i = ThreadIdx.y;j = ThreadIdx.x;RawIdx = M*i + j; // = WarpId + SP

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 13 / 32

Page 27: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Multithreaded+SIMD Execution on a GPU

b.0

IF DEC ALU MEM WB

loop: ld a[i,k]

ld b[k,j]

k++

*

b< loop

+=

SP=0

SP=1

SP=2

SP=3

a.1 a.0a.2

a.0

a.0

a.0

a.1

a.1

a.1

a.3

a.2

a.2

a.2

Instruction a.0 means execute ld a[i,k] for warp 0.Data parallelism:i = ThreadIdx.y;j = ThreadIdx.x;RawIdx = M*i + j; // = WarpId + SP

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 13 / 32

Page 28: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

A GPU “Streaming Multiprocessor”

warp

F

Iregisters

lots ofdeep pipeline

(20−30 stages)

registerslots of

deep pipeline

(20−30 stages)

registerslots of

deep pipeline

(20−30 stages)

shared

memory

shared

memory

shared

memory

crossbar

switch

I$

PCDEC

ctrl

scheduler

inst inst

No pipeline bypassesI the pipelines are multi-threadedI A group of threads that execute one-per-pipeline is called a “warp”.I The warp-scheduler determines which instruction to dispatch next.

Caches, branches, and other issuesI We’ll handle them in upcoming lectures.I It all comes back to keep the architecture simple and use more

threads.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 14 / 32

Page 29: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

And there’s more

SM

GDDR

GDDR

GD

DR

GD

DR

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SML2$

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

Each SM is a SIMD pipelineas shown on the previousslide.

I SIMD = Single-InstructionMultiple-Data

A latest-and-greatest GPUtoday has

I 80 SMsI 64 pipelines per SMI HBM2 memory – 32

Gbytes, 900GBytes/sec

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 15 / 32

Page 30: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

CUDA – the programmers viewThreads, warps, blocks, and CGMA – oh my!

How does the programmer cope with SIMD?I Lots of threads – each thread runs on a separate pipeline.I A group of thread that execute together, on on each pipeline of a

SIMD core are called “a warp”.How does the programmer cope with long pipeline latencies,∼30cycles?

I Lots of threads – interleave threads so that other threads dispatchinstructions while waiting for result of current instruction.

I Note that the need for threads to use multiple pipelines and theneed to use threads to high pipeline latency are multiplicative

I CUDA programs have thousands of threads.How does the programmer use many SIMD cores?

I Multiple blocks of threads.I Why are threads partitioned into blocks?

F Threads in the same block can synchronize and communicate easily– they are running on the same SIMD core.

F Threads in different blocks cannot communicate with each other.F There is some relaxation of this constraint in the latest GPUs.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 16 / 32

Page 31: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

CUDA Program Structure

A CUDA program consists of three kinds of functions:I Host functions:

F callable from code running on the host, but not the GPU.F run on the host CPU;F In CUDA C, these look like normal functions – they can be preceded

by the host qualifier.I Device functions.

F callable from code running on the GPU, but not the host.F run on the GPU;F In CUDA C, these are declared with a device qualifier.

I Global functionsF called by code running on the host CPU,F they execute on the GPU.F In CUDA C, these are declared with a global qualifier.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 17 / 32

Page 32: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Structure of a simple CUDA program

A global function to called by the host program to execute onthe GPU.

I There may be one or more device functions as well.One or more host functions, including main to run on the hostCPU.

I Allocate device memory.I Copy data from host memory to device memory.I “Launch” the device kernel by calling the global function.I Copy the result from device memory to host memory.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 18 / 32

Page 33: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Execution Model: Memory

GPUCPU

caches DDR

memoryhost

GDDR GPU, off−chip

"global" memory

Host memory: DRAM and the CPU’s cachesI Accessible to host CPU but not to GPU.

Device memory: GDDR DRAM on the graphics card.I Accessible by GPU.I The host can initiate transfers between host memory and device

memroy.The CUDA library includes functions to:

I Allocate and free device memory.I Copy blocks between host and device memory.I BUT host code can’t read or write the device memory directly.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 19 / 32

Page 34: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Example: saxpy

saxpy = “Scalar a times x plus y”.The device code.The host code.The running saxpy

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 20 / 32

Page 35: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: device code

global void saxpy(uint n, float a, float *x, float *y) {uint i = blockIdx.x*blockDim.x + threadIdx.x; // nvcc built-insif(i < n)

y[i] = a*x[i] + y[i];}

Each thread has x, y, and z indices.I We’ll just use x for this simple example.

Note that we are creating one thread per vector element:I Exploits GPU hardware support for multithreading.I We need to keep in mind that there are a large, but limited number

of threads available.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 21 / 32

Page 36: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: host code (part 1 of 5)

int main(int argc, char **argv) {uint n = atoi(argv[1]);float *x, *y, *yy;float *dev x, *dev y;int size = n*sizeof(float);x = (float *)malloc(size);y = (float *)malloc(size);yy = (float *)malloc(size);for(int i = 0; i < n; i++) {

x[i] = i;y[i] = i*i;

}...

}

Declare variables for the arrays on the host and device.Allocate and initialize values in the host array.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 22 / 32

Page 37: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: host code (part 2 of 5)

int main(void) {...cudaMalloc((void**)(&dev x), size);cudaMalloc((void**)(&dev y), size);cudaMemcpy(dev x, x, size, cudaMemcpyHostToDevice);cudaMemcpy(dev y, y, size, cudaMemcpyHostToDevice);...

}

Allocate arrays on the device.Copy data from host to device.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 23 / 32

Page 38: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: host code (part 3 of 5)

int main(void) {...float a = 3.0;saxpy<<<ceil(n/256.0),256>>>(n, a, dev x, dev y);cudaMemcpy(yy, dev y, size, cudaMemcpyDeviceToHost);...

}

Invoke the code on the GPU:I add<<<ceil(n/256.0),256>>>(...) says to create dn/256e

blocks of threads.I Each block consists of 256 threads.I See slide 28 for an explanation of threads and blocks.I The pointers to the arrays (in device memory) and the values of n

and a are passed to the threads.

Copy the result back to the host.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 24 / 32

Page 39: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: host code (part 4 of 5)

...for(int i = 0; i < n; i++) { // check the resultif(yy[i] != a*x[i] + y[i]) {

fprintf(stderr,"ERROR: i=%d, a[i]=%f, b[i]=%f, c[i]=%f\n",i, a[i], b[i], c[i]);

exit(-1);}

}printf("The results match!\n");...

}

Check the results.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 25 / 32

Page 40: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

saxpy: host code (part 5 of 5)

int main(void) {...free(x);free(y);free(yy);cudaFree(dev x);cudaFree(dev y);exit(0);

}

Clean up.We’re done.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 26 / 32

Page 41: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Launching Kernels

TerminologyI Data parallel code that runs on the GPU is called a kernel.I Invoking a GPU kernel is called launching the kernel.

How to launch a kernelI The host CPUS invokes a global function.I The invocation needs to specify how many threads to create.I Example:

F add<<<ceil(n/256.0),256>>>(...)F creates

⌈n

256

⌉blocks

F with 256 threads each.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 27 / 32

Page 42: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Threads and BlocksThe GPU hardware combines threads into warps

I Warps are an aspect of the hardware.I All of the threads of warp execute together – this is the SIMD part.I The functionality of a program doesn’t depend on the warp details.I But understanding warps is critical for getting good performance.

Each warp has a “next instruction” pending execution.I If the dependencies for the next instruction are resolved, it can

execute for all threads of the warp.I The hardware in each streaming multiprocessor dispatches an

instruction each clock cycle if a ready instruction is available.I The GPU in lin25 supports 32 such warps of 32 threads each in a

“thread block.”What if our application needs more threads?

I Threads are grouped into “thread blocks”.I Each thread block has up to 1024 threads (the HW limit).I The GPU dispatches thread blocks to SMs when an SM has

sufficient resources.F This is GPU system software that we don’t see as user-level

programmers.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 28 / 32

Page 43: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Compiling and runningDo this on a linXX.students.cs.ubc.ca machine:

lin25$ nvcc saxpy.cu -o saxpylin25$ ./saxpy 1000The results match!

Where is nvcc?I /cs/local/lib/pkg/cudatoolkit-11.0.3/bin/nvccI Other CUDA tools are there as well. You’ll probably want to add that

directory to your $PATH shell variable.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 29 / 32

Page 44: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

But is it fast?

For the saxpy example as written here, not really.I Execution time dominated by the memory copies.

But, it shows the main pieces of a CUDA program.To get good performance:

I We need to perform many operations for each value copiedbetween memories.

I We need to perform many operations in the GPU for each access toglobal memory.

I We need enough threads to keep the GPU cores busy.I We need to watch out for thread divergence:

F If different threads execute different paths on an if-then-else,F Then the else-threads stall while the then-threads execute, and

vice-versa.I And many other constraints.

GPUs are great if your problem matches the architecture.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 30 / 32

Page 45: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Preview

So far, we’ve covered Kirk & Hwu chapters 1-2.

Coming up:Nov. 2: GPUs and Multithreading

Reading: Kirk & Hwu Chapter 3.Homework: Homework 3 goes out (finally!)

Nov. 4: Threads, Blocks, and GridsNov. 6: GPU Memory (part 1)

Reading: Kirk & Hwu – Chapter 4Nov 9: GPU Memory (part 2)

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 31 / 32

Page 46: GPU Architecturs and Executioncs-418/2020-1/lecture/10...CpSc 418 – October 30, 2020 Unless otherwise noted or cited, these slides are copyright 2020 by Mark Greenstreet and are

Review

What is SIMD execution?What don’t GPUs have pipeline bypasses?Why do CUDA programs use so many threads?Think of a modification to the saxpy program and try it.

I You’ll probably find you’re missing programming features for manythings you’d like to try.

I What do you need?I Stay tuned for upcoming lectures.

Greenstreet GPU arch & exec CpSc 418 – Oct. 30, 2020 32 / 32