Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example

Lecture 5: HW1 Discussion, Intro to GPUs

G63.2011.002/G22.2945.001 · October 5, 2010

Discuss HW1 Intro to GPU Computing

Outline

Discuss HW1

Intro to GPU Computing


Outline

Discuss HW1



Dense Matrix Multiply: Blocking vs Scalar

We provided a blocked example matrixmultiplication code.Why is blocked matmul faster thanun-blocked?

Key: Computational Intensity

Definition:Flops per FPN moved up the memoryhierarchy

Large intensity: good for deep memoryhierarchies


Computational Intensity for Scalar Matmul

Floating Point operations: 2N3

Assume: Size(L1) � N2 FPNs

N2 read each row of A once+ N3 read each column of B N times

+ 2N2 read/write C

N3 + 3N2 FPN-size cache misses

(neglecting cache lines, etc.)

Computational Intensity: about 2


Computational Intensity for Blocked MatmulFloating Point operations: still 2N3

b: block size n: dN/be

b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C

2b2n3 + 2N2 FPN-size cache misses

Rewrite:

b2n3 ≈ b2 N3

b3=

N3

b

Computational Intensity:

2N3

2N3/b + 2N2≈ 2N3

2N3/b= b

→ incentive to choose b � 2

The power of assumptions:Can we choose b = N?


Computational Intensity for Blocked MatmulFloating Point operations: still 2N3

b: block size n: dN/be

b2n3 read each block of A n3 times+ b2n3 same for B+ 2N2 read/write C

2b2n3 + 2N2 FPN-size cache misses

Rewrite:

b2n3 ≈ b2 N3

b3=

N3

b

Computational Intensity:

2N3

2N3/b + 2N2≈ 2N3

2N3/b= b

→ incentive to choose b � 2

The power of assumptions:Can we choose b = N?


Hatching a Plan

Consider each level of the memory hierarchy.

How do we exploit. . .

• . . . L2: Ignore–we’re nearly L2-local atmost sizes.

• . . . L1: 32 KiB = 4096 Floats.Key: memory layout.

• . . . registers: 16 FP registers.Key: loop/operation ordering.


Optimizing for L1: Memory Layout

Memory layout of A: column-major.

Only use one entry of each cache line perfetch.

Better to store A in row-major order.

Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)


Optimizing for L1: Memory Layout

Memory layout of A: column-major.

Only use one entry of each cache line perfetch.

Better to store A in row-major order.

Input is row-major. If memory available (notswap!), storing a transposed copy of A can bea good idea. (Copy takes O(N2) time.)


Optimizing for L1: Reuse Pattern, Block Size

QuestionBlocking: good idea. Optimal bL1?

Follow-up question:How much needs to fit in L1?

One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60





One block of each of A, B, C .

All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60





One block of each of A, B, C .

All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60





One block of each of A, B, C .All of A, plus one column of B and C .32 kiB: 8b2

L1 + 2 · 8bL1 → bL1 ≤ 60


L1 Block Copy

Further concerns:

• Cache line boundaries

• SIMD

• Cache set conflicts

All solved by small-block copyoptimization.

Copy all of A.Copy bL1-sized blocks of A, B, and C ,operate on those, then copy outputback.


L1 Block Copy: The Plan

Basic plan:

For each i :For each j :

Load Block C [i , j ]For each k :

Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:

C + = ABStore Block C [i , j ]

(can be improved: many A, B loads)

Aside: Also neatly deals with fringes.

So: how does this solve the problems above?Can you define “alignment”?


L1 Block Copy: The Plan

Basic plan:

For each i :For each j :

Load Block C [i , j ]For each k :

Load Block A[i , k]Load Block B[k , j ]dbL1/bre3 register kernels:

C + = ABStore Block C [i , j ]

(can be improved: many A, B loads)

Aside: Also neatly deals with fringes.

So: how does this solve the problems above?Can you define “alignment”?


Alignment

A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h>

/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;

int error = posix memalign((void ∗∗) &var, 64, array size );

if ( error )abort ();

/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];

Examples: Cache-line-aligned, SIMD-aligned.

Code generation in the non-aligned case?


http://www.ibm.com/developerworks/library/pa-dalign/

Alignment

A memory address a is n-byte aligned when n is a power of twoand a is a multiple of n bytes. (see also IBM devWorks article)

#include <stdlib.h>

/∗ dynamic allocation ∗/double ∗ attribute (( aligned (64))) var ;

int error = posix memalign((void ∗∗) &var, 64, array size );

if ( error )abort ();

/∗ static allocation ∗/double attribute (( aligned (64))) ary2 [500];

Examples: Cache-line-aligned, SIMD-aligned.

Code generation in the non-aligned case?


http://www.ibm.com/developerworks/library/pa-dalign/

Register KernelChoose block size br = 2k ,with bL1 mod br = 0.

for ( int j = 0; j < b r; ++j)for ( int k = 0; k < b r; ++k)for ( int i = 0; i < b r; ++i)

C[i+j∗b l1] +=A[i+k∗b l1] ∗ B[k+j∗b l1 ];

For each Ab matvec:Perform br scalar·vector updates.

• Vectorizable

• Pipeline-friendly(min. data dependencies)

• Access to A, C unit-stride

• Access to B is inner-loop invariant

• Unrolling, software pipelining: Compiler


Psychoanalyzing the Compiler

Flags for Intel:-O3 -fno-alias -funroll-loops

-std=c99 -D XOPEN SOURCE=500

-opt-streaming-stores auto -static

-fast -xHost

Flags for GCC:-O3 -funroll-loops -march=native

-std=c99 -D XOPEN SOURCE=500

-ftree-vectorizer-verbose=2

-ffast-math

GCC 4.3 sometimes better than GCC 4.4.

Self-study material:

• Compiler Reference: Intel GNU

• C99 restrict keyword, Aliasing


http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/index.htm

http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/

http://en.wikipedia.org/wiki/Restrict

http://en.wikipedia.org/wiki/Aliasing_(computing)

Profiling

OProfile: A sampling profiler. Uses Performance counters. Linuxonly, needs root.

Many event types countable:

CPU CLK UNHALTED : Clock cycles when not halted

L2 RQSTS : number of L2 cache requests

LLC MISSES : L2 cache demand requests from this core that

missed the L2

FLOPS : number of FP computational micro-ops executed

IDLE DURING DIV : cycles divider is busy and all other

execution units are idle.

L1D ALL REF : All references to the L1 data cache

L1D PEND MISS : Total number of outstanding L1 data cache

misses at any cycle

IFU MEM STALL : cycles instruction fetch pipe is stalled

INST RETIRED : number of instructions retired

UOPS RETIRED : number of UOPs retired

MACHINE NUKES SMC : number of pipeline flushing events

RAT STALLS : Partial register stall cycles

BR INST DECODED : number of branch instructions decoded

FLOPS L1D PEND MISS8 2.6e−04 18 0.7037 movsd 0x50(%rax),%xmm7

187 0.0062 8 0.3127 movsd 0x58(%rax),%xmm57 2.3e−04 24 0.9382 movsd 0x60(%rax),%xmm3

470 0.0155 18 0.7037 movsd 0x68(%rax),%xmm449 0.0016 9 0.3518 movsd 0x70(%rax),%xmm2

2873 0.0950 7 0.2737 movsd 0x78(%rax),%xmm1434 0.0144 8 0.3127 xchg %ax,%ax

184312 6.0959 26 1.0164 movsd (%rdx),%xmm02022 0.0669 14 0.5473 inc %esi19 6.3e−04 3 0.1173 mulsd (%rcx),%xmm0

5294 0.1751 189 7.3886 addsd 0x30(%rsp),%xmm031888 1.0547 68 2.6583 movsd %xmm0,(%rax)66032 2.1839 37 1.4464 movsd %xmm0,0x30(%rsp)114001 3.7704 43 1.6810 movsd (%rcx),%xmm01131 0.0374 3 0.1173 mulsd 0x8(%rdx),%xmm011913 0.3940 2 0.0782 addsd %xmm0,%xmm1494565 3.1276 20 0.7819 movsd %xmm14,0x8(%rax)108501 3.5885 25 0.9773 movsd (%rcx),%xmm0

4 1.3e−04 1 0.0391 mulsd 0x10(%rdx),%xmm076622 2.5342 81 3.1665 addsd %xmm0,%xmm1582075 2.7145 42 1.6419 movsd %xmm15,0x10(%rax)119036 3.9370 36 1.4073 movsd (%rcx),%xmm0

5 1.7e−04 0 0 mulsd 0x18(%rdx),%xmm02700 0.0893 0 0 addsd %xmm0,%xmm1214861 0.4915 11 0.4300 movsd %xmm12,0x18(%rax)


http://oprofile.sourceforge.net/

Profiling






missed the L2






misses at any cycle

















Profiling






missed the L2






misses at any cycle

















Profiling






missed the L2






misses at any cycle

















Profiling






missed the L2






misses at any cycle

















Solution Performance

0 100 200 300 400 500 600 700 800Matrix dimension N

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MFl

ops/

s

basic

tuned

blas

git clone

ssh://[email protected]:2234/hw1-solution.git

(Private, works if you signed up for an account.)

Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.

Want to see code of a “real” BLAS?GotoBLAS2


http://www.tacc.utexas.edu/tacc-projects/gotoblas2/

Solution Performance

0 100 200 300 400 500 600 700 800Matrix dimension N

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

MFl

ops/

s

basic

tuned

blas

git clone

ssh://[email protected]:2234/hw1-solution.git

(Private, works if you signed up for an account.)

Great–but:Most BLAS lose out to triple-loops forspecial-case matrices.

Want to see code of a “real” BLAS?GotoBLAS2


http://www.tacc.utexas.edu/tacc-projects/gotoblas2/

Key Messages of HW1

In HPC:

• Very simple things quickly becomerather complex.

• Need: ideas, careful analysis.

• Flexibility ↔ performance

• Run-time code generation can beuseful.

This class helps by introducing

• known tricks

• helpful tools.

Matmul is a “microcosm” of single-procoptimization.

Do not worry if you did not figure outthe tricks here on your own.


Key Messages of HW1

In HPC:

• Very simple things quickly becomerather complex.

• Need: ideas, careful analysis.

• Flexibility ↔ performance

• Run-time code generation can beuseful.

This class helps by introducing

• known tricks

• helpful tools.

Matmul is a “microcosm” of single-procoptimization.

Do not worry if you did not figure outthe tricks here on your own.


Questions?

?


Outline

Discuss HW1



GPUs: System Context

Processor

Memory

Expansion Slots

PCI-Express (x4, x16, x1, x16)and regular PCI

PCIe V2, x16 Bandwidth:∼ 6 GB/s

GPU goes here



Processor

Memory

Expansion Slots



GPU goes here



Processor

Memory

Expansion Slots



GPU goes here



Processor

Memory

Expansion Slots



GPU goes here



Processor

Memory

Expansion Slots



GPU goes here


GPU Computing?

• Design target for CPUs:• Make a single thread very fast• Take control away from

programmer

• GPU Computing takes adifferent approach:

• Throughput matters—single threads do not

• Give explicit control toprogrammer


“CPU-style” Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU (Execute)

Fetch/ Decode

Execution Context

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data cache (A big one)

13

Credit: Kayvon Fatahalian (Stanford)


Slimming down


Slimming down

ALU (Execute)

Fetch/ Decode

Execution Context

Idea #1:

Remove components that help a single instruction stream run fast

14



More Space: Double the Number of Cores


Two cores (two fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

fragment 1

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

fragment 2

15



. . . again


Four cores (four fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

16



. . . and again


Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17


→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent


. . . and again


Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17


→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent


Saving Yet More Space


Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/


Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD






Fetch/ Decode

ALU (Execute)

Execution Context



Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2


→ SIMD





Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4


SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2


→ SIMD





Add ALUs

Fetch/ Decode

Idea #2:





Ctx Ctx Ctx Ctx

Shared Ctx Data


Add ALUs

Fetch/ Decode

Idea #2:





Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Idea #2


→ SIMD



Gratuitous Amounts of Parallelism!


128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Great if everybody in a group does thesame thing.

But what if not?

What leads to divergent instructionstreams?







Example:



But what if not?








Example:



But what if not?



Branches


But what about branches?

ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

26



Branches



ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F F F F F

27



Branches



ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F F F F F

Not all ALUs do useful work! Worst case: 1/8 performance

28



Branches



ALU 1 ALU 2 . . . ALU 8 . . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F F F F F

29



Remaining Problem: Slow Memory

ProblemMemory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

• caches

• branch prediction

• out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory

= A solution!




We’ve removed

• caches



So what now?

Idea #3


= A solution!




We’ve removed

• caches



So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Idea #3


= A solution!




We’ve removed

• caches



So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/


(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Idea #3


= A solution!


Hiding Memory Latency



(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33






(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4


34






(clocks)

Stall

Runnable

1 2 3 4


35






(clocks)

Stall

Runnable

1 2 3 4


36






(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable


37





Throughput! Time

(clocks)

Stall

Runnable

2 3 4


Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one group To maximum throughput of many groups

Start

Start

Start

38



GPU Architecture Summary

Core Ideas:

1. Many slimmed down cores→ lots of parallelism

2. More ALUs, Fewer Control Units

3. Avoid memory stalls by interleavingexecution of SIMD groups



GPU-CPU Bird’s Eye Comparison

Floorplan: VIA Isaiah (2008)65 nm, 4 SP ops at a time, 1MiB L2.

Floorplan: AMD RV770 (2008)55 nm, 800 SP opsat a time.


Nvidia GTX200

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Fetch/Decode

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

4×

DP ALU

32 kiB CtxPrivate

16 kiB CtxShared

Off-chip Memory150 GB/s


GPU Architecture (e.g. Nvidia GT200)

• 1 GPU = 30 SIMD cores

• 1 SIMD core: 32× 32 PCs,HW Sched + 1 ID (1/4 clock) +8 SP + 1 DP + 16 KiB Shared +32 KiB Reg

• Device ↔ RAM: 140 GB/s

• Device ↔ Host: 6 GB/s

• User manages memory hierarchy


What is OpenCL?

OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]

• Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)

• Vendor-neutral

• Comes with RTCG

Defines:

• Host-side programming interface (library)

• Device-side programming language (!)


Questions?

?


Image Credits

• Blocks: sxc.hu/Avolore• Flag: sxc.hu/Ambrozjo

• Mainboard: Wikimedia Commons

• PCI Express slots: Wikimedia Commons

• Fighting chips: flickr.com/oskay• Isaiah die shot: VIA Technologies• RV770 die shot: AMD Corp.• Nvidia Tesla Architecture: Nvidia Corp.


Documents

Lecture 5: HW1 Discussion, Intro to GPUs · Discuss HW1 Intro to GPU Computing Discuss HW1Intro to GPU Computing. Dense Matrix Multiply: Blocking vs Scalar We provided a blocked example