73
HSA AND THE MODERN GPU Lee Howes AMD Heterogeneous System Software

HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

HSA AND THE MODERN GPU

Lee Howes

AMD Heterogeneous System Software

Page 2: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

2 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

HSA AND THE MODERN GPU

In this brief talk we will cover three topics:

– Changes to the shader core and memory system

– Changes to the use of pointers

– Architected definitions to use these new features

We’ll look both at how the hardware is becoming more flexible, and how the changes will benefit OpenCL

implementations.

Page 3: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

3 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE HD7970

AND

GRAPHICS CORE NEXT

Page 4: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

4 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

GPU EXECUTION AS WAS

We often view GPU programming as a set of independent threads, more reasonably known as “work items” in OpenCL:

kernel void blah(global float *input, global float *output) {

output[get_global_id(0)] = input[get_global_id(0)];

}

Which we flatten to an intermediate language known as AMD IL:

Note that AMD IL contains short vector instructions

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 5: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

5 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector

Page 6: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

6 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 7: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

7 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 8: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

8 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 9: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

9 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 10: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

10 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

MAPPING TO THE HARDWARE

The GPU hardware of course does not execute those work items as threads

The reality is that high-end GPUs follow a SIMD architecture

– Each work item describes a lane of execution

– Multiple work items execute together in SIMD fashion with a single program counter

– Some clever automated stack management to handle divergent control flow across the vector mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

mov r255, r1021.xyz0

mov r255, r255.x000

mov r256, l9.xxxx

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r253.x___, r2.xxxx, r255.xxxx

mov r255, r1022.xyz0

mov r255, r255.x000

ishl r255.x___, r255.xxxx, r256.xxxx

iadd r254.x___, r1.xxxx, r255.xxxx

mov r1010.x___, r254.xxxx

uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx

mov r254.x___, r1011.xxxx

mov r1011.x___, r254.xxxx

mov r1010.x___, r253.xxxx

uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx

ret

Page 11: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

11 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

IT WAS NEVER QUITE THAT SIMPLE

The HD6970 architecture and its predecessors were combined multicore SIMD/VLIW machines

– Data-parallel through hardware vectorization

– Instruction parallel through both multiple cores and VLIW units

The HD6970 issued a 4-way VLIW instruction per work item

– Architecturally you could view that as a 4-way VLIW instruction issue per SIMD lane

– Alternatively you could view it as a 4-way VLIW issue of SIMD instructions

Page 12: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

12 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THAT MEAN TO THE PROGRAMMER?

The IL we saw earlier ends up compiling to something like this:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(9) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)

0 w: LSHL ____, R0.x, 2

1 z: ADD_INT ____, KC0[0].x, PV0.w

2 y: LSHR R0.y, PV1.z, 2

3 x: MULLO_INT R1.x, R1.x, KC1[1].x

y: MULLO_INT ____, R1.x, KC1[1].x

z: MULLO_INT ____, R1.x, KC1[1].x

w: MULLO_INT ____, R1.x, KC1[1].x

01 TEX: ADDR(48) CNT(1)

4 VFETCH R2.x___, R0.y, fc153

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(41) CNT(7) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

5 w: ADD_INT ____, R0.x, R1.x

6 z: ADD_INT ____, PV5.w, KC0[6].x

7 y: LSHL ____, PV6.z, 2

8 x: ADD_INT ____, KC1[1].x, PV7.y

9 x: LSHR R0.x, PV8.x, 2

03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM

04 END

END_OF_PROGRAM

Page 13: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

13 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THAT MEAN TO THE PROGRAMMER?

The IL we saw earlier ends up compiling to something like this:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(9) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)

0 w: LSHL ____, R0.x, 2

1 z: ADD_INT ____, KC0[0].x, PV0.w

2 y: LSHR R0.y, PV1.z, 2

3 x: MULLO_INT R1.x, R1.x, KC1[1].x

y: MULLO_INT ____, R1.x, KC1[1].x

z: MULLO_INT ____, R1.x, KC1[1].x

w: MULLO_INT ____, R1.x, KC1[1].x

01 TEX: ADDR(48) CNT(1)

4 VFETCH R2.x___, R0.y, fc153

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(41) CNT(7) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

5 w: ADD_INT ____, R0.x, R1.x

6 z: ADD_INT ____, PV5.w, KC0[6].x

7 y: LSHL ____, PV6.z, 2

8 x: ADD_INT ____, KC1[1].x, PV7.y

9 x: LSHR R0.x, PV8.x, 2

03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM

04 END

END_OF_PROGRAM

Clause header Work executed by the

shared scalar unit

Page 14: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

14 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THAT MEAN TO THE PROGRAMMER?

The IL we saw earlier ends up compiling to something like this:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(9) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)

0 w: LSHL ____, R0.x, 2

1 z: ADD_INT ____, KC0[0].x, PV0.w

2 y: LSHR R0.y, PV1.z, 2

3 x: MULLO_INT R1.x, R1.x, KC1[1].x

y: MULLO_INT ____, R1.x, KC1[1].x

z: MULLO_INT ____, R1.x, KC1[1].x

w: MULLO_INT ____, R1.x, KC1[1].x

01 TEX: ADDR(48) CNT(1)

4 VFETCH R2.x___, R0.y, fc153

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(41) CNT(7) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

5 w: ADD_INT ____, R0.x, R1.x

6 z: ADD_INT ____, PV5.w, KC0[6].x

7 y: LSHL ____, PV6.z, 2

8 x: ADD_INT ____, KC1[1].x, PV7.y

9 x: LSHR R0.x, PV8.x, 2

03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM

04 END

END_OF_PROGRAM

Clause body Units of work dispatched

by the shared scalar unit

Clause header Work executed by the

shared scalar unit

Page 15: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

15 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THAT MEAN TO THE PROGRAMMER?

The IL we saw earlier ends up compiling to something like this:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(9) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)

0 w: LSHL ____, R0.x, 2

1 z: ADD_INT ____, KC0[0].x, PV0.w

2 y: LSHR R0.y, PV1.z, 2

3 x: MULLO_INT R1.x, R1.x, KC1[1].x

y: MULLO_INT ____, R1.x, KC1[1].x

z: MULLO_INT ____, R1.x, KC1[1].x

w: MULLO_INT ____, R1.x, KC1[1].x

01 TEX: ADDR(48) CNT(1)

4 VFETCH R2.x___, R0.y, fc153

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(41) CNT(7) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

5 w: ADD_INT ____, R0.x, R1.x

6 z: ADD_INT ____, PV5.w, KC0[6].x

7 y: LSHL ____, PV6.z, 2

8 x: ADD_INT ____, KC1[1].x, PV7.y

9 x: LSHR R0.x, PV8.x, 2

03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM

04 END

END_OF_PROGRAM

Clause body Units of work dispatched

by the shared scalar unit

Clause header Work executed by the

shared scalar unit

VLIW instruction packet Compiler-generated instruction level

parallelism for the VLIW unit.

Each instruction (x, y, z, w) executed

across the vector.

Page 16: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

16 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THAT MEAN TO THE PROGRAMMER?

The IL we saw earlier ends up compiling to something like this:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(9) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15)

0 w: LSHL ____, R0.x, 2

1 z: ADD_INT ____, KC0[0].x, PV0.w

2 y: LSHR R0.y, PV1.z, 2

3 x: MULLO_INT R1.x, R1.x, KC1[1].x

y: MULLO_INT ____, R1.x, KC1[1].x

z: MULLO_INT ____, R1.x, KC1[1].x

w: MULLO_INT ____, R1.x, KC1[1].x

01 TEX: ADDR(48) CNT(1)

4 VFETCH R2.x___, R0.y, fc153

FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU: ADDR(41) CNT(7) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

5 w: ADD_INT ____, R0.x, R1.x

6 z: ADD_INT ____, PV5.w, KC0[6].x

7 y: LSHL ____, PV6.z, 2

8 x: ADD_INT ____, KC1[1].x, PV7.y

9 x: LSHR R0.x, PV8.x, 2

03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM

04 END

END_OF_PROGRAM

Clause body Units of work dispatched

by the shared scalar unit

Clause header Work executed by the

shared scalar unit

VLIW instruction packet Compiler-generated instruction level

parallelism for the VLIW unit.

Each instruction (x, y, z, w) executed

across the vector.

Notice the poor occupancy of VLIW slots

Page 17: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

17 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHY DID WE SEE INEFFICIENCY?

The architecture was well suited to graphics workloads:

– VLIW was easily filled by the vector-heavy graphics kernels

– Minimal control flow meant that the monolithic, shared thread scheduler was relatively efficient

Unfortunately, workloads change with time.

So how did we change the architecture to improve the situation?

Page 18: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

18 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

AMD RADEON HD7970 - GLOBALLY

Brand new – but at this level it doesn’t look too different

Page 19: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

19 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

AMD RADEON HD7970 - GLOBALLY

Brand new – but at this level it doesn’t look too different

Two command processors

– Capable of processing two command queues concurrently

Page 20: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

20 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

AMD RADEON HD7970 - GLOBALLY

Brand new – but at this level it doesn’t look too different

Two command processors

– Capable of processing two command queues concurrently

Full read/write L1 data caches

SIMD cores grouped in fours

– Scalar data and instruction cache per cluster

– L1, LDS and scalar processor per core

Up to 32 cores / compute units

Page 21: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

21 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

AMD RADEON HD7970 - GLOBALLY

Brand new – but at this level it doesn’t look too different

Two command processors

– Capable of processing two command queues concurrently

Full read/write L1 data caches

SIMD cores grouped in fours

– Scalar data and instruction cache per cluster

– L1, LDS and scalar processor per core

Up to 32 cores / compute units

Page 22: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

22 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

The SIMD unit on the HD6970 architecture had a branch control but full scalar execution was performed

globally

Page 23: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

23 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

The SIMD unit on the HD6970 architecture had a branch control but full scalar execution was performed

globally

Page 24: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

24 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

On the HD7970 we have a full scalar processor and the L1 cache and LDS have been doubled in size

Page 25: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

25 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

On the HD7970 we have a full scalar processor and the L1 cache and LDS have been doubled in size

Then let us consider the VLIW ALUs

Page 26: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

26 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

Remember we could view the architecture two ways:

– An array of VLIW units

Page 27: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

27 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

Remember we could view the architecture two ways:

– An array of VLIW units

– A VLIW cluster of vector units

Page 28: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

28 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE SIMD CORE

Now that we have a scalar processor we can dynamically schedule instructions rather than relying on the

compiler

No VLIW!

The heart of Graphics Core Next:

– A scalar processor with four 16-wide vector units

– Each lane of the vector, and hence each IL work item, is now scalar

Page 29: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

29 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE NEW ISA AND BRANCHING

float fn0(float a,float b)

{

if(a>b)

return((a-b)*a);

else

return((b-a)*b);

}

//Registers r0 contains “a”, r1 contains “b”

//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current exec mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all lanes fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

label0:

s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)

s_cbranch_execz label1 //Branch if all lanes fail

v_sub_f32 r2,r1,r0 //result = b – a

v_mul_f32 r2,r2,r1 //result = result * b

label1:

s_mov_b64 exec,s0 //Restore exec mask

Simpler and more efficient

Instructions for both sets of execution units

inline

No VLIW

– Fewer compiler-induced bubbles in the

instruction schedule

Full support for exceptions, function calls

and recursion

Page 30: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

30 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE NEW ISA AND BRANCHING

float fn0(float a,float b)

{

if(a>b)

return((a-b)*a);

else

return((b-a)*b);

}

//Registers r0 contains “a”, r1 contains “b”

//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current exec mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all lanes fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

label0:

s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)

s_cbranch_execz label1 //Branch if all lanes fail

v_sub_f32 r2,r1,r0 //result = b – a

v_mul_f32 r2,r2,r1 //result = result * b

label1:

s_mov_b64 exec,s0 //Restore exec mask

Simpler and more efficient

Instructions for both sets of execution units

inline

No VLIW

– Fewer compiler-induced bubbles in the

instruction schedule

Full support for exceptions, function calls

and recursion

Page 31: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

31 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE NEW ISA AND BRANCHING

float fn0(float a,float b)

{

if(a>b)

return((a-b)*a);

else

return((b-a)*b);

}

//Registers r0 contains “a”, r1 contains “b”

//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current exec mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all lanes fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

label0:

s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)

s_cbranch_execz label1 //Branch if all lanes fail

v_sub_f32 r2,r1,r0 //result = b – a

v_mul_f32 r2,r2,r1 //result = result * b

label1:

s_mov_b64 exec,s0 //Restore exec mask

Simpler and more efficient

Instructions for both sets of execution units

inline

No VLIW

– Fewer compiler-induced bubbles in the

instruction schedule

Full support for exceptions, function calls

and recursion

Page 32: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

32 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

THE NEW ISA AND BRANCHING

float fn0(float a,float b)

{

if(a>b)

return((a-b)*a);

else

return((b-a)*b);

}

//Registers r0 contains “a”, r1 contains “b”

//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current exec mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all lanes fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

label0:

s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)

s_cbranch_execz label1 //Branch if all lanes fail

v_sub_f32 r2,r1,r0 //result = b – a

v_mul_f32 r2,r2,r1 //result = result * b

label1:

s_mov_b64 exec,s0 //Restore exec mask

Simpler and more efficient

Instructions for both sets of execution units

inline

No VLIW

– Fewer compiler-induced bubbles in the

instruction schedule

Full support for exceptions, function calls

and recursion

Page 33: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

33 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

64kB read-write L1 cache

Instruction decode etc

512kB read-write L2 cacheFMISC SSE

FMUL SSE

FADD SSEScalarunits

(3 scalar ALUs, branch control etc)

FAMILIAR?

If we add the frontend of the core…

64kB Local Data Share: 32 banks with integer atomic units

16kB read-write L1 cache

Scalarprocessor

Instruction decode etc

“Graphics Core Next” core

“Barcelona” core

Page 34: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

34 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

64kB read-write L1 cache

Instruction decode etc

512kB read-write L2 cacheFMISC SSE

FMUL SSE

FADD SSEScalarunits

(3 scalar ALUs, branch control etc)

FAMILIAR?

If we add the frontend of the core…

64kB Local Data Share: 32 banks with integer atomic units

16kB read-write L1 cache

Scalarprocessor

Instruction decode etc

“Graphics Core Next” core

“Barcelona” core

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

Page 35: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

35 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

64kB read-write L1 cache

Instruction decode etc

512kB read-write L2 cacheFMISC SSE

FMUL SSE

FADD SSEScalarunits

(3 scalar ALUs, branch control etc)

FAMILIAR?

If we add the frontend of the core…

64kB Local Data Share: 32 banks with integer atomic units

16kB read-write L1 cache

Scalarprocessor

Instruction decode etc

“Graphics Core Next” core

“Barcelona” core

v_cmp_gt_f32 r0,r1 //a > b, establish VCC

s_mov_b64 s0,exec //Save current mask

s_and_b64 exec,vcc,exec //Do “if”

s_cbranch_vccz label0 //Branch if all fail

v_sub_f32 r2,r0,r1 //result = a – b

v_mul_f32 r2,r2,r0 //result=result * a

How would an implicitly vectorized program look mapped onto here using SSE

instructions?

So different?

Page 36: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

36 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SHARED VIRTUAL MEMORY

Page 37: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

37 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

TAKING THE MEMORY SYSTEM CHANGES FURTHER AFIELD

GCN-based devices are more flexible

– Start to look like CPUs with few obvious shortcomings

The read/write cache is a good start at improving the memory system

– Improves efficiency on the device

– Provides a buffer for imperfectly written code

We needed to go a step further on an SoC

– Memory in those caches should be the same memory used by the “host” CPU

– In the long run, the CPU and GPU become peers, rather than having a host/slave relationship

Page 38: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

38 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THIS MEAN?

Page 39: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

39 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THIS MEAN?

We can store x86 virtual pointers here

Page 40: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

40 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THIS MEAN?

We can store x86 virtual pointers here

Data stored here is addressed in the same way as that on the CPU

Page 41: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

41 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THIS MEAN?

We can store x86 virtual pointers here

Data stored here is addressed in the same way as that on the CPU

We can map this into the same virtual address space

Page 42: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

42 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

WHAT DOES THIS MEAN?

We can store x86 virtual pointers here

Data stored here is addressed in the same way as that on the CPU

We can map this into the same virtual address space

We can perform work on CPU data directly here

Page 43: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

43 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

USE CASES FOR THIS ARE FAIRLY OBVIOUS

Pointer chasing algorithms with mixed GPU/CPU use

Algorithms that construct data on the CPU, use it on the GPU

Allows for more fine-grained data use without explicit copies

Covers cases where explicit copies are difficult:

– Picture OS allocated data that the OpenCL runtime doesn’t know about

However, that wasn’t quite enough to achieve our goals…

Page 44: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

44 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SO WHAT ELSE DO WE NEED?

Page 45: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

45 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SO WHAT ELSE DO WE NEED?

We need a global view of the GPU, not just of the shader cores

Page 46: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

46 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SO WHAT ELSE DO WE NEED?

We need a global view of the GPU, not just of the shader cores

Of course, we need to see the

data here too

Page 47: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

47 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SO WHAT ELSE DO WE NEED?

We need a global view of the GPU, not just of the shader cores

Of course, we need to see the

data here too

Most importantly! We need to be able to compute on the

same data here.

Page 48: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

48 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

SO WHAT ELSE DO WE NEED?

We need a global view of the GPU, not just of the shader cores

Let’s look at how GPU work dispatch works currently

Of course, we need to see the

data here too

Most importantly! We need to be able to compute on the

same data here.

Page 49: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

49 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

TODAY’S COMMAND AND DISPATCH FLOW

Hardware

Queue

A GPU

HARDWARE

C

B A B

Page 50: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

50 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

TODAY’S COMMAND AND DISPATCH FLOW

Hardware

Queue

A GPU

HARDWARE

C

B A B

Data copied into kernel

memory

Page 51: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

51 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

TODAY’S COMMAND AND DISPATCH FLOW

Hardware

Queue

A GPU

HARDWARE

C

B A B

Data copied into kernel

memory

Queue packets explicitly

DMAd to the device

Page 52: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

52 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

FUTURE COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional Dispatch

Buffer

GPU

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

Page 53: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

53 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

FUTURE COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional Dispatch

Buffer

GPU

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

Queues are in user-mode

virtual memory

Page 54: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

54 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

FUTURE COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional Dispatch

Buffer

GPU

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

Queues are in user-mode

virtual memory

Command processor can read

queues directly

Page 55: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

55 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

FUTURE COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional Dispatch

Buffer

GPU

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

No required APIs

No Soft Queues

No User Mode Drivers

No Kernel Mode Transitions

No Overhead!

Application codes to the

hardware

User mode queuing

Hardware scheduling

Low dispatch times

Page 56: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

56 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

ARCHITECTED ACCESS

Page 57: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

57 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

HETEROGENEOUS SYSTEM ARCHITECTURE – AN OPEN PLATFORM

Open Architecture, published specifications

– HSAIL virtual ISA

– HSA memory model

– Architected Queuing Language

HSA system architecture

– Inviting partners to join us, in all areas

– Hardware companies

– Operating Systems

– Tools and Middleware

– Applications

HSA Foundation being formed

Page 58: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

58 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

ARCHITECTED INTERFACES

Standardize interfaces to features of the

system

– The compute cores

– The memory hierarchy

– Work dispatch

Standardize access to the device

– Memory backed queues

– User space data structures

Page 59: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

59 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

HSA INTERMEDIATE LAYER - HSAIL

HSAIL is a virtual ISA for parallel programs

– Finalized to ISA by a runtime compiler or

“Finalizer”

Explicitly parallel

– Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Syscall methods

– GPU code can call directly to system

services, IO, printf, etc

Debugging support

Page 60: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

60 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

HSA MEMORY MODEL

Designed to be compatible with C++11,

Java and .NET Memory Models

Relaxed consistency memory model for

parallel compute performance

Loads and stores can be re-ordered by

the finalizer

Visibility controlled by:

– Load.Acquire

– Store.Release

– Barriers

Page 61: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

61 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

ARCHITECTED QUEUING LANGUAGE

Defines dispatch characteristics in a small

packet in memory

– Platform neutral work offload

Designed to be interpreted by the device

– Firmware implementations

– Or directly implemented in hardware

Page 62: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

62 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 63: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

63 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 64: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

64 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 65: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

65 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 66: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

66 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 67: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

67 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 68: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

68 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 69: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

69 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUEUES

User space memory allows queues to span devices

Standardized packet format (AQL) enables flexible and portable use

Single consumer, multiple producer of work

– Enables support for task queuing runtimes and device->self enqueue

Queue

Page 70: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

70 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

Hardware - APUs, CPUs, GPUs

AMD user mode component AMD kernel mode component All others contributed by third parties or AMD

Driver Stack

Domain Libraries

OpenCL™ 1.x, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

Apps Apps

Apps Apps

Apps Apps

HSA Software Stack

Task Queuing

Libraries

HSA Domain Libraries

HSA Kernel

Mode Driver

HSA Runtime

HSA JIT

Apps Apps

Apps Apps

Apps Apps

Page 71: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

71 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

OPENCL™ AND HSA

HSA is an optimized platform architecture

for OpenCL™

– Not an alternative to OpenCL™

OpenCL™ on HSA will benefit from

– Avoidance of wasteful copies

– Low latency dispatch

– Improved memory model

– Pointers shared between CPU and GPU

HSA also exposes a lower level programming

interface, for those that want the ultimate in

control and performance

– Optimized libraries may choose the lower

level interface

Page 72: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

72 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

QUESTIONS

Page 73: HSA AND THE MODERN GPU - Lee Howes · HSA AND THE MODERN GPU In this brief talk we will cover three topics: –Changes to the shader core and memory system –Changes to the use of

73 | IEEE Workshop: Data Parallelism for Multi-Core Chips and GPU | 27 October 2012 | Public

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

© 2011 Advanced Micro Devices, Inc.