62
GPU Architecture An OpenCL Programmer’s Introduction Lee Howes | November 3, 2010

Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

GPU ArchitectureAn OpenCL Programmer’s Introduction

Lee Howes | November 3, 2010

Page 2: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone2

The aim of this webinar

To provide a general background to modern GPU architectures

To place the AMD GPU designs in context:

– With other types of architecture

– With other GPU designs

To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way

Page 3: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone3

Agenda

Talk about GPUs as graphics processing devices

– What they are designed for

– What this means architecturally

The implications of SIMD execution on application development

LDS and latency hiding

How the GPU fits in the “CPU” design space.

A description of features of AMD Radeon HD5870 GPU

Page 4: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone4

What is a GPU?

Page 5: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone5

In a nutshell

The GPU is a multicore processor optimized for graphics workloads

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

Tex

Tex

Tex

Tex

Rasterizer

Output blend

Video decode

Scheduler

Page 6: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone6

Processing pixels

Pixel

Direction of light

Normal at surface

Page 7: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone7

Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixel

Page 8: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone8

Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixelsampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}

Page 9: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone9

Processing pixels

Pixel

Direction of light

Normal at surface

Pixel

Pixel Pixel

Pixel Pixel

Pixel Pixel

Page 10: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone10

SIMD execution and its implications

Page 11: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone11

SIMD pixel execution

Pixel Pixel

Pixel Pixel

sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}

sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}

sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}

sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);

}

Page 12: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone12

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 13: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone13

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 14: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone14

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 15: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone15

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 16: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone16

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 17: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone17

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 18: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone18

Branches that diverge

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

sampler mySamp;Buffer<float> myTex;

float diffuseShader(float threshold, float index)

{float brightness = myTex[index];float output;if( brightness > threshold )

output = threshold;else

output = brightness;return output;

}

ALU ALU ALU ALU

Page 19: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone19

SIMD execution: SIMD instructions?

Programming SIMD with

SIMD instructions

Most vector instructions sets:

SSE, AVX

Intel’s Larrabee and Knight’s Corner

Programming SIMD with

“scalar” instructions

GPU shader languages

GPU intermediate languages

OpenCL

Programmed masking

Hardware controlled masking

OpenCL compiling to SSE

Pixel shaders compiling to Larrabee.

Current generation GPUs

Page 20: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone20

Why does this matter for Compute?

Graphics code traditionally has relatively short shaderson large triangles

– The level of branch divergence overall will not be high

With graphics code you can not necessarily control it

– SIMD batches are constructed by the hardware depending on the scene properties.

For OpenCL code you are defining your execution space

– You choose what work is performed by which work item

– You choose how to structure your algorithm to avoid this divergence

Page 21: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone21

Throughput execution and latency hiding

Page 22: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone22

Covering pipeline latency

Stall

Instruction 0

Instruction 1

Lanes 0-3

Page 23: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone23

Covering pipeline latency: logical vector

Lanes 0-3 Lanes 4-7

Stall

Stall

Lanes 8-11 Lanes 12-15

Stall

Stall

Instruction 0

Instruction 1

Page 24: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone24

Covering pipeline latency: ALU operations

Lanes 0-3 Instruction 0

Lanes 4-7 Instruction 0

Lanes 8-11 Instruction 0

Lanes 12-15 Instruction 0

Lanes 0-3 Instruction 1

Lanes 4-7 Instruction 1

Lanes 8-11 Instruction 1

Lanes 12-15 Instruction 1

Page 25: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone25

Covering memory latency

Instruction 0

Instruction 1

Stall

Lanes 0-3

Page 26: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone26

Covering memory latency: we still stall

Lanes 0-3 Lanes 4-7

Stall

Stall

Lanes 8-11 Lanes 12-15

Stall

Stall

Instruction 0

Instruction 1

Page 27: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone27

Covering memory latency: another wavefront

Instruction 1

Instruction 0

Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15

Instruction 0

Page 28: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone28

Latency hiding in the SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Page 29: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone29

Latency hiding in the SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Pixel Pixel Pixel Pixel

Page 30: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone30

Latency hiding in the SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Page 31: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone31

A throughput-oriented SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Pixel Pixel Pixel Pixel

Page 32: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone32

A throughput-oriented SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Pixel Pixel Pixel Pixel

State

State

Page 33: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone33

Adding the memory hierarchy

Unlike most CPUs, GPUs do not have vast cache hierarchies.

– Caches on CPUs allow primarily for lower access latency

Heavy multithreading reduces the latency requirement

– Latency is not an issue, we cover that with other threads

– Total bandwidth still an issue, even with high-latency high-speed memory interfaces

Page 34: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone34

Texture caches and local memory

Designed to support sharing between work items

– Reduce bandwidth, not latency

Global

High latency load. Limited bandwidth.

Local

SIMD engine

Efficient random accesses. Very high bandwidth

Page 35: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone35

A throughput-oriented SIMD engine

Pixel Pixel Pixel Pixel

ALU ALU ALU ALU

Pixel Pixel Pixel Pixel

State

State

Fetch

Decode

Execute

Local storage/Cache

Page 36: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone36

A throughput-oriented SIMD engine

Fetch

Decode

Execute

Local storage/Cache

Page 37: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone37

The GPU shader cores

Page 38: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone38

The design space

Page 39: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone39

The AMD Phenom™ II X6

6 cores

One state set per core

4-wide SIMD (actually there are two pipes)

Page 40: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone40

The Intel i7 6-core variants

6 cores

Two state sets per core (SMT/Hyperthreading)

4-wide SIMD

Phenom II X6

Page 41: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone41

Sun UltraSPARC T2

8 cores

Eight state sets per core

No SIMD

Phenom II X6Intel i7 6-core

Page 42: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone42

The AMD HD5870 GPU

20 cores

Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements)

16-wide physical

Phenom II X6Intel i7 6-core UltraSPARC T2

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Page 43: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone43

The AMD Radeon HD5870 GPU

Page 44: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone44

Features of the Radeon™ HD5870 architecture

Area 334 mm2

Transistors 2.15 billion

Memory Bandwidth 153 GB/sec

L2-L1 Rd Bandwidth 512 bytes/clk

L1 Bandwidth 1280 bytes/clk

Vector GPR 5.24 MB

LDS Memory 640kB

LDS Bandwidth 2560 bytes/clk

Concurrent Wavefronts 496

Shader (ALU units) 1600

Idle power 27 W

Max power 188 W

ATI Radeon™ HD 5870

2.72 Teraflops architecture

Page 45: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone45

High level viewCommand Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches8kB/mem channel

Write combine cachesWrite combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

Page 46: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone46

Clause execution

Command Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches

Write combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)

12 ALU: ADDR(73) CNT(12)

19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

Page 47: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone47

Clause execution

Command Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches

Write combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)

12 ALU: ADDR(73) CNT(12)

19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

Page 48: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone48

Clause execution

Command Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches

Write combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)

12 ALU: ADDR(73) CNT(12)

19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

Page 49: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone49

Clause execution

Command Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches

Write combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)

12 ALU: ADDR(73) CNT(12)

19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

Page 50: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone50

Clause execution

Command Processor/Group Generator

Sequencer SequencerGDS

Rd

cach

e C

ro

ssb

ar

In

s c

ach

e

SIM

D E

ngin

es 0

-9

SIM

D E

ngin

es10-1

9

In

s c

ach

e

Write crossbar

Read L2 caches

Write combine caches

Write combine caches

R/W cache for global atomics

R/W cache for global atomics

8 channel memory controller

GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5

08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)

15 x: SETGT_INT R0.x, R2.x, R4.x

16 x: PREDNE_INT ____, R0.x, 0.0f…

09 JUMP POP_CNT(1) ADDR(18)

10 ALU: ADDR(64) CNT(9)

17 x: SUB_INT R5.x, R3.x, R2.x

y: LSHL ____, R3.x, (…).x

z: LSHL ____, R2.x, (…).x VEC_120

t: MOV R8.x, 0.0f

18 x: SUB_INT R6.x, PV17.y, PV17.z

y: MOV R8.y, 0.0f

z: MOV R8.z, 0.0f

w: MOV R8.w, 0.0f

11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)

12 ALU: ADDR(73) CNT(12)

19 x: LSHL ____, R5.x, (…).x

w: ADD_INT ____, R6.x, R1.x VEC_120

t: ADD_INT R7.x, R1.x, (…).y

13 TEX: ADDR(368) CNT(2)

22 VFETCH R0.x___, R0.y, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

23 VFETCH R1.x___, R0.z, fc156 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

Page 51: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone51

The SIMD Engine

16 processing elements: physical vector

Executes two 64-element wavefronts over 4 cycles: logical vector

– Each lane executes a 5-way VLIW instruction

Lane masking and branching to support divergence

Hardware barrier support for up to 8 work groups per SIMD engine

SEQ/Branch control

32kB Local Data Share: 32 banks with integer atomic units

8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store

Page 52: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone52

The SIMD Engine

16 processing elements: physical vector

Executes two 64-element wavefronts over 4 cycles: logical vector

– Each lane executes a 5-way VLIW instruction

Lane masking and branching to support divergence

Hardware barrier support for up to 8 work groups per SIMD engine

SEQ/Branch control

32kB Local Data Share: 32 banks with integer atomic units

8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store

Page 53: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone53

The SIMD Engine

16 processing elements: physical vector

Executes two 64-element wavefronts over 4 cycles: logical vector

– Each lane executes a 5-way VLIW instruction

Lane masking and branching to support divergence

Hardware barrier support for up to 8 work groups per SIMD engine

SEQ/Branch control

32kB Local Data Share: 32 banks with integer atomic units

8kB Read L1

Filter & Format

Filter & Format

Filter & Format

Filter & Format

Address & Load

Address & Load

Address & Load

Address & Load

Exp/Ld/Store

Page 54: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone54

The Local Data Share

SIMD Engine

Conflict detection

andcontrol

scheduling

Source collectors and return data staging

Input address cross bar

Read data cross bar

Write data cross bar

B0

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

B13

B14

B15

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

Integer atomic units

SEQ PE0 PE15

Pre

-op re

turn

valu

e

Page 55: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone55

LDS features

High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD).

– Fully coalesced reads, writes and atomics with optimization for broadcast reads

Low latency access

– 0 latency direct reads

– 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline)

Bank conflicts hardware detected and serialized

Page 56: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone56

The processing element (using OpenCL terms)

Operand Preparation

4 32b FP FMA or MullAdd1 64b FMA or Mul2 64b FP Add4 24b Int Mul or MulAdd4 32b Int Add, Logical or Special

General Purpose Registers

1 32b FP MulAdd1 32b Integer1 32b Special(log, exp, rcp, sin…)

Fetch Addr/Data

Export Addr/Data

LDS Requests

LDS op 0

LDS op 1

Constants

Input Data

Fetch Return

Page 57: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone57

Dependent operations

Co-issue of dependent operations in a single VLIW packet

– Full IEEE intermediate rounding & normalization

– Dot4 (A= A*B + C*D + E*F + G*H)

– Dual Dot2 (A = A*B + C*D; F = F*H +

I*J)

– Dual dependent multiplies (A = B * C * D; F = G * H * I)

– Dual dependent adds (A = B * C * D; F = G * H * I)

– Dependent MulAdd (A = B * C + D + E * F)

24 bit integer

– Mul and Muladd (4 way c-issue)

– Heavy use for workgroup address calculation

Page 58: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone58

The Global Data Share

Dual arrays of SIMD Engines

Conflict detection

andcontrol

scheduling

Fast append counter control

Left and right source collectors (4 Wis/clock each) and return data staging

Input address cross bar

Read data cross bar

Write data cross bar

B0

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

B13

B14

B15

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

Integer atomic units

SEQ Left bus Right bus

Pre

-op re

turn

valu

e

Page 59: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone59

GDS features

Low latency access to a global shared memory

– 25 clocks latency

– Issued in separate clause in the same way as texture accesses

– Fully coalesced reads, writes and atomics as LDS.

8 work items per clock can request

Driver allocation and initialization

Useful for low latency global reductions

Page 60: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone60

Summary

We’ve:

– looked at the basic principles of the GPU architecture in the processor design space

– seen some of the tradeoffs that lead to GPU features

– gone over the basic features of the HD 5870 architecture that affect compute applications

Later talks will go into detail on GPU optimizations on these architecture features

Page 61: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone61

Questions and Answers

Visit the OpenCL Zone on developer.amd.com

http://developer.amd.com/zones/OpenCLZone/

The OpenCL Programming Webinars page includes:

Schedule of upcoming webinars

On-demand versions of this and past webinars

Slide decks of this and past webinars

– Upcoming webinars include

OpenCL Programming In Detail

Real World Application Example

Optimization Techniques

And Device Fission Extensions for OpenCL

Page 62: Title (Verdana Bold 30pt) - AMDdeveloper.amd.com/wordpress/media/2012/10/GPU... · 3 | GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone Agenda Talk about GPUs

| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone62

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2009 Advanced Micro Devices, Inc. All rights reserved.