Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

GPU ArchitectureAn OpenCL Programmer’s Introduction

Lee Howes | November 3, 2010

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone2

The aim of this webinar

To provide a general background to modern GPU architectures

To place the AMD GPU designs in context:

– With other types of architecture

– With other GPU designs

To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way


Agenda

Talk about GPUs as graphics processing devices

– What they are designed for

– What this means architecturally

The implications of SIMD execution on application development

LDS and latency hiding

How the GPU fits in the “CPU” design space.

A description of features of AMD Radeon HD5870 GPU


What is a GPU?


In a nutshell

The GPU is a multicore processor optimized for graphics workloads

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

Tex

Tex

Tex

Tex

Rasterizer

Output blend

Video decode

Scheduler


Processing pixels

Pixel

Direction of light

Normal at surface


Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixel


Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixelsampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}


Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixel

Pixel Pixel

Pixel Pixel


SIMD execution and its implications


SIMD pixel execution

Pixel Pixel

Pixel Pixel

sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}


}


}


}


Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU








}






}






}






}

ALU ALU ALU ALU


SIMD execution: SIMD instructions?

Programming SIMD with

SIMD instructions

Most vector instructions sets:

SSE, AVX

Intel’s Larrabee and Knight’s Corner

Programming SIMD with

“scalar” instructions

GPU shader languages

GPU intermediate languages

OpenCL

Programmed masking

Hardware controlled masking

OpenCL compiling to SSE

Pixel shaders compiling to Larrabee.

Current generation GPUs


Why does this matter for Compute?

Graphics code traditionally has relatively short shaderson large triangles

– The level of branch divergence overall will not be high

With graphics code you can not necessarily control it

– SIMD batches are constructed by the hardware depending on the scene properties.

For OpenCL code you are defining your execution space

– You choose what work is performed by which work item

– You choose how to structure your algorithm to avoid this divergence


Throughput execution and latency hiding


Covering pipeline latency

Stall

Instruction 0

Instruction 1

Lanes 0-3


Covering pipeline latency: logical vector

Lanes 0-3 Lanes 4-7

Stall

Stall

Lanes 8-11 Lanes 12-15

Stall

Stall

Instruction 0

Instruction 1


Covering pipeline latency: ALU operations

Lanes 0-3 Instruction 0









Covering memory latency

Instruction 0

Instruction 1

Stall

Lanes 0-3


Covering memory latency: we still stall

Lanes 0-3 Lanes 4-7

Stall

Stall

Lanes 8-11 Lanes 12-15

Stall

Stall

Instruction 0

Instruction 1


Covering memory latency: another wavefront

Instruction 1

Instruction 0

Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15

Instruction 0


Latency hiding in the SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU




ALU ALU ALU ALU





ALU ALU ALU ALU


A throughput-oriented SIMD engine


ALU ALU ALU ALU





ALU ALU ALU ALU


State

State


Adding the memory hierarchy

Unlike most CPUs, GPUs do not have vast cache hierarchies.

– Caches on CPUs allow primarily for lower access latency

Heavy multithreading reduces the latency requirement

– Latency is not an issue, we cover that with other threads

– Total bandwidth still an issue, even with high-latency high-speed memory interfaces


Texture caches and local memory

Designed to support sharing between work items

– Reduce bandwidth, not latency

Global

High latency load. Limited bandwidth.

Local

SIMD engine

Efficient random accesses. Very high bandwidth




ALU ALU ALU ALU


State

State

Fetch

Decode

Execute

Local storage/Cache



Fetch

Decode

Execute

Local storage/Cache


The GPU shader cores


The design space


The AMD Phenom™ II X6

6 cores

One state set per core

4-wide SIMD (actually there are two pipes)


The Intel i7 6-core variants

6 cores

Two state sets per core (SMT/Hyperthreading)

4-wide SIMD

Phenom II X6


Sun UltraSPARC T2

8 cores

Eight state sets per core

No SIMD

Phenom II X6Intel i7 6-core


The AMD HD5870 GPU

20 cores

Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements)

16-wide physical

Phenom II X6Intel i7 6-core UltraSPARC T2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core


The AMD Radeon HD5870 GPU


Features of the Radeon™ HD5870 architecture

Area 334 mm2

Transistors 2.15 billion

Memory Bandwidth 153 GB/sec

L2-L1 Rd Bandwidth 512 bytes/clk

L1 Bandwidth 1280 bytes/clk

Vector GPR 5.24 MB

LDS Memory 640kB

LDS Bandwidth 2560 bytes/clk

Concurrent Wavefronts 496

Shader (ALU units) 1600

Idle power 27 W

Max power 188 W

ATI Radeon™ HD 5870

2.72 Teraflops architecture


High level viewCommand Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches8kB/mem channel

Write combine cachesWrite combine caches

R/W cache for global atomics


8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5


Clause execution

Command Processor/Group Generator


Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches






08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)


19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)



Clause execution



Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches









16 x: PREDNE_INT ____, R0.x, 0.0f…




y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f


y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f



19 x: LSHL ____, R5.x, (…).x









Clause execution



Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches









16 x: PREDNE_INT ____, R0.x, 0.0f…




y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f


y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f



19 x: LSHL ____, R5.x, (…).x









Clause execution



Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches









16 x: PREDNE_INT ____, R0.x, 0.0f…




y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f


y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f



19 x: LSHL ____, R5.x, (…).x









Clause execution



Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches









16 x: PREDNE_INT ____, R0.x, 0.0f…




y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f


y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f



19 x: LSHL ____, R5.x, (…).x









The SIMD Engine

16 processing elements: physical vector

Executes two 64-element wavefronts over 4 cycles: logical vector

– Each lane executes a 5-way VLIW instruction

Lane masking and branching to support divergence

Hardware barrier support for up to 8 work groups per SIMD engine

SEQ/Branch control

32kB Local Data Share: 32 banks with integer atomic units

8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store


The SIMD Engine






SEQ/Branch control


8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store


The SIMD Engine






SEQ/Branch control


8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store


The Local Data Share

SIMD Engine

Conflict detection

andcontrol

scheduling

Source collectors and return data staging

Input address cross bar

Read data cross bar

Write data cross bar

B0

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

B13

B14

B15

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

Integer atomic units

SEQ PE0 PE15

Pre

-op re

turn

valu

e


LDS features

High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD).

– Fully coalesced reads, writes and atomics with optimization for broadcast reads

Low latency access

– 0 latency direct reads

– 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline)

Bank conflicts hardware detected and serialized


The processing element (using OpenCL terms)

Operand Preparation

4 32b FP FMA or MullAdd1 64b FMA or Mul2 64b FP Add4 24b Int Mul or MulAdd4 32b Int Add, Logical or Special

General Purpose Registers

1 32b FP MulAdd1 32b Integer1 32b Special(log, exp, rcp, sin…)

Fetch Addr/Data

Export Addr/Data

LDS Requests

LDS op 0

LDS op 1

Constants

Input Data

Fetch Return


Dependent operations

Co-issue of dependent operations in a single VLIW packet

– Full IEEE intermediate rounding & normalization

– Dot4 (A= A*B + C*D + E*F + G*H)

– Dual Dot2 (A = A*B + C*D; F = F*H +

I*J)

– Dual dependent multiplies (A = B * C * D; F = G * H * I)

– Dual dependent adds (A = B * C * D; F = G * H * I)

– Dependent MulAdd (A = B * C + D + E * F)

24 bit integer

– Mul and Muladd (4 way c-issue)

– Heavy use for workgroup address calculation


The Global Data Share

Dual arrays of SIMD Engines

Conflict detection

andcontrol

scheduling

Fast append counter control

Left and right source collectors (4 Wis/clock each) and return data staging

Input address cross bar

Read data cross bar

Write data cross bar

B0

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

B13

B14

B15

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

Integer atomic units

SEQ Left bus Right bus

Pre

-op re

turn

valu

e


GDS features

Low latency access to a global shared memory

– 25 clocks latency

– Issued in separate clause in the same way as texture accesses

– Fully coalesced reads, writes and atomics as LDS.

8 work items per clock can request

Driver allocation and initialization

Useful for low latency global reductions


Summary

We’ve:

– looked at the basic principles of the GPU architecture in the processor design space

– seen some of the tradeoffs that lead to GPU features

– gone over the basic features of the HD 5870 architecture that affect compute applications

Later talks will go into detail on GPU optimizations on these architecture features


Questions and Answers

Visit the OpenCL Zone on developer.amd.com

http://developer.amd.com/zones/OpenCLZone/

The OpenCL Programming Webinars page includes:

Schedule of upcoming webinars

On-demand versions of this and past webinars

Slide decks of this and past webinars

– Upcoming webinars include

OpenCL Programming In Detail

Real World Application Example

Optimization Techniques

And Device Fission Extensions for OpenCL

http://developer.amd.com/zones/OpenCLZone/


Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2009 Advanced Micro Devices, Inc. All rights reserved.

Documents

Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs