18
ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

Embed Size (px)

Citation preview

Page 1: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

ATI Stream ComputingATI Intermediate Language (IL)

Micah VillmowMay 30, 2008

Page 2: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL – What is it?

• Device agnostic forward compatible language

• Called Intermediate Language

• Portable ISA

• Can write for lowest common denominator

• First level to expose new ATI CAL features

• Allows finely-detailed optimizations

• Based on Microsoft® DirectX® 9.0 Shader Language

Page 3: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential3 3 | ATI Stream Computing – ATI Intermediate Language (IL)

Outline

• Pipeline – A quick recap

• Instructions Setup and teardown ALU Texture units Memory access Functions Flow Control

• Examples

• Future additions

Page 4: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential4 4 | ATI Stream Computing – ATI Intermediate Language (IL)

Pixel Pipeline

• IL instructions modify the state of the various stages of the pipeline

• Declarations instruction the setup engine how to setup the graphics card correctly

• ALU instructions instruct the stream processing units what do calculate

• TEX instructions instruct the texture units what data to fetch

• Global buffer accesses instruct the shader export path to get correct data

• Color buffer instructions send data through the render backends

Page 5: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential5 5 | ATI Stream Computing – ATI Intermediate Language (IL)

Compute Pipeline

• ATI Radeon™ HD 4800 Series GPUs introduce compute shader

• Pipeline now includes LDS, GDS, and SR

• Dedicated L1 per SIMD on ATI Radeon™ HD 4800 Series GPUs

Page 6: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential6 6 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instruction Syntax

The language to write CAL Shader

A portable immediate language for AMD GPUs

Resembles DirectX® assembly

ATI IL kernel follows basic pattern of:

1. Setup state

2. Read texture data

3. Compute results

4. Write results

Page 7: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential7 7 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions - SetupShader Type:

il_ps_2_0 – IL Pixel Shader version 2.0

il_cs_2_0 – IL Compute Shader version 2.0

Inputs:

dcl_input_position_interp(linear_noperspective)_centered vWinCoord0.xy__ - Interpolated X/Y float coordinates

dcl_input vObjIndex0 - Auto-indexed integer value

Outputs:

dcl_output_generic oN - Declare that color output buffer number N will be used, max N is 8 on R6XX based cards and 16 on R7XX

Constants:

dcl_cb cbN[X] – Declare that constant buffer N will be used of size X, N is between 0-14, max X is 4096

Literals:

dcl_literal lN, <NUM>, <NUM>, <NUM>, <NUM> - Declare that literal number N will be used with four values

Resources:

dcl_resource(N)_type([1d|2d],[unnorm|norm])_fmtx(TYPE)_fmty(TYPE) _fmtz(TYPE)_fmtw(TYPE)

Scratch Buffer:

dcl_indexed_temp_array N[X] – Declare that scratch buffer N will be used of size X, max size 4096

Compute Shader:

dcl_num_thread_per_group N - Declare that N threads will be working together in one group

Local Data Share:

dcl_lds_size_per_thread N – Declare that each thread will use N dwords of LDS, must be multiple of 4 and <= 64

dcl_lds_sharing_mode _wavefront[Rel|Abs] – Declare that sharing mode of LDS uses relative or absolute addressing

Global Shared Registers:

dcl_shared_temp srN – Declare that the kernel will use N shared registers.

Page 8: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential8 8 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions - Registers

vObjIndex0.x – Integer register that stores the index of the thread within the domain

vWinCoord0.xy – Floating point register that stores the Euclidean coordinates of the thread within the domain

r# - General Purpose Registers that are 128 bits wide

x#[idx] – Scratch buffer register to read/write at offset idx

l# - Literal register that is 128 bits wide

cb#[idx] – constant buffer access to read from offset idx

g[idx] – Global buffer read/write register

o# - Output buffer register

Page 9: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential9 9 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions – Syntax/Opcodes

Instruction syntax<opcode>[_<ctrl>][_<ctrl(val)>] <= opcode with specifiers

[<dst>[_<mod>][.<write-mask>]] <= dst with modifier/mask

[, <src>[_<mod>][.<swizzle-mask>]] <= src with modifier/mask

Sample Opcodes ALU:

– mad r0, r1, r2, r3 // r0 = r1 * r2 + r3

– dmul r0.xy, r1.xy, r2.xy // same as above but with doubles

TEX:– sample_resource(0)_sampler(0) r0, vWinCoord0.xy00

– sample_l_resource(0)_sampler(0) r0, vWinCoord0.xy00, r0.1000 // sample instruction required in loops

MEM:– lds_read_vec r0, vTid0.x0 // read from the current threads lds space at offset 0

– lds_write_vec mem.xy__, vaTid0.xxxx // Write the absolute thread id

Page 10: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions – Write Masks/Read Swizzles

Write Masks: Each destination can have a write mask

There are four possible combinations for each component

– Component – The original component position, which means write results

– ‘_’ – Do not write the results of this component to the register

– ‘0’ – Write the value 0.0f to the destination component

– ‘1’ – Write the value 1.0f to the destination component

Example: “mov r0.x10w, vWinCoord0.xy”, Places copies x element over and places y element in the w component of r0.

Read Swizzles: Each source register can have a read swizzle

The read swizzle reorders the way in which data is read

Read swizzles are extended based on the last swizzle used to fill the vector. i.e. r0.xy is equivalent to r0.xyyy

Each component can have up to one of 6 options

– Component – Each position in the 4 vector can have a component specified, i.e. {xyzw} and there is no restriction on ordering

– ‘1’ – Use the value 1.0f as the source component

– ‘0’ – Use the value 0.0f as the source component

Example “mov r0, r0.wzyx” – Reverses the data in a register

Page 11: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1111 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions – Functions

Functions are possible in IL following a few constraints:

1. Must begin with “func <integer>”

2. Must end with “endfunc”

3. Must use “ret” before “endfunc”

4. Only use “ret_dyn” for early_return

5. Must be placed after main function

Main function must use “endmain” if functions are in use

To call a function, use “call <integer>” or the conditional versions.

Page 12: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1212 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions - Example

il_ps_2_0

dcl_literal l0, 0x40800000, 0x3f800000, 0x40000000, 0x40400000

dcl_cb cb0[2]

dcl_input_position_interp(linear_noperspective)_centered vWinCoord0.xy__

dcl_output_generic o0

dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

mul r0, vWinCoord0.xy00, l0.z100

add r17, r0.xyxy, l0.00y0

mov r33, r0.xyxy

div_zeroop(infinity) r33, r0.1111, r33

sample_resource(1)_sampler(1) r101, r17.xy00

sample_resource(1)_sampler(1) r102, r17.zw00

mul r35, r35, r33

add r101, r101, r35_neg(xyzw)

mad r35, r101, r42.xxxx, r35

mul r36, r36, r33

add r102, r102, r36_neg(xyzw)

mad r36, r102, r42.xxxx, r36

mad r19.x, r0.y, cb0[0].z, r0.x

ftoi r21.x, r19.x

mov o0, r35

mov g[r21.x], r36

ret

end

;PS; -------- Disassembly --------------------00 ALU: ADDR(32) CNT(7) KCACHE0(CB0:0-15) 0 x: MOV*2 R0.x, R0.x y: MOV R0.y, R0.y z: MULADD R0.z, R0.x, (0x40000000, 2.0f).x, 1.0f 1 z: MULADD R4.z, PV0.y, KC0[0].z, PV0.x w: MOV R0.w, PV0.y t: RCP_e R1.z, PV0.x 01 TEX: ADDR(64) CNT(2) VALID_PIX 2 SAMPLE R2, R0.xyxx, t1, s1 UNNORM(XYZW) 3 SAMPLE R3, R0.zwzz, t1, s1 UNNORM(XYZW) 02 ALU: ADDR(39) CNT(22) 4 x: MUL T2.x, 0.0f, R1.z t: RCP_e ____, R0.y 5 x: ADD T0.x, R2.z, -PV4.x y: ADD T0.y, R3.x, -PV4.x z: ADD ____, R2.x, -PV4.x VEC_120 w: MUL T0.w, 0.0f, PS4 t: ADD T1.x, R3.z, -PV4.x 6 x: ADD ____, R3.y, -PV5.w y: ADD ____, R2.y, -PV5.w VEC_120 z: ADD T0.z, R3.w, -PV5.w w: ADD ____, R2.w, -PV5.w VEC_120 t: MULADD R0.x, PV5.z, 0.0f, T2.x VEC_021 7 x: MULADD R2.x, T0.y, 0.0f, T2.x y: MULADD R0.y, PV6.y, 0.0f, T0.w z: MULADD R0.z, T0.x, 0.0f, T2.x w: MULADD R0.w, PV6.w, 0.0f, T0.w t: MULADD R2.y, PV6.x, 0.0f, T0.w VEC_021 8 z: MULADD R2.z, T1.x, 0.0f, T2.x w: MULADD R2.w, T0.z, 0.0f, T0.w t: F_TO_I ____, R4.z 9 t: MULLO_INT R3.x, PS8, (0x00000004, 5.605193857e-45f).x 03 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R3.x], R2, ELEM_SIZE(3) 04 EXP_DONE: PIX0, R0END_OF_PROGRAMGprPoolSize = 122CodeLen = 544;BytesSQ_PGM_END_CF = 5; words(64 bit)SQ_PGM_END_ALU = 61; words(64 bit)SQ_PGM_END_FETCH = 68; words(64 bit);SQ_PGM_RESOURCES = 0x00000005SQ_PGM_RESOURCES:NUM_GPRS = 5

Page 13: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1313 | ATI Stream Computing – ATI Intermediate Language (IL)

ATI IL Instructions – Flow Control

Flow control is based on the result of comparison instructions. 4 signed integer comparison instructions + negation 2 unsigned integer comparison instructions 4 floating point comparison instructions 4 double comparison instructions

Flow control consists of: if-else-endif or if-endif call-return in static and conditional versions switch-case1…n-endswitch whileloop-continue/break-endloop

Page 14: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1414 | ATI Stream Computing – ATI Intermediate Language (IL)

Example Slide 1 – Output IL & Input IL

il_ps_2_0

dcl_output_generic o0

dcl_literal l0, 1.0, 0.5, 0.5, 0.5

mov o0, l0

end

il_ps_2_0

dcl_input_position_interp(linear_noperspective)_centered_center vWinCoord0.xy__

dcl_output_generic o0

dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

sample_resource(0)_sampler(0) o0, vWinCoord0.xy

end

Page 15: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1515 | ATI Stream Computing – ATI Intermediate Language (IL)

Example Slide 2 - Bursting

il_cs_2_0dcl_cb cb0[1]dcl_num_thread_per_group 64itof r0.z, vaTid0.xdiv r0.y, r0.z, cb0[0].xmod r0.x, r0.z, cb0[0].xflr r0, r0mul r0.x, r0.x, cb0[0].zdcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)imul r0.w, vaTid0.x, cb0[0].wsample_resource(0)_sampler(0) r1, r0.xyadd r0.x, r0.x, r0.1sample_resource(0)_sampler(0) r2, r0.xyadd r0.x, r0.x, r0.1sample_resource(0)_sampler(0) r3, r0.xyadd r0.x, r0.x, r0.1sample_resource(0)_sampler(0) r4, r0.xyadd r0.x, r0.x, r0.1mov g[r0.w + 0], r1mov g[r0.w + 1], r2mov g[r0.w + 2], r3mov g[r0.w + 3], r4end

export_burst_perf.exe –w 2048 –h 2048 –t –e –r -2

Burst 1 Perf:88.73GB/sBurst 2 Perf:104.98GB/sBurst 3 Perf:111.39GB/sBurst 4 Perf:114.49GB/s

03 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3) BRSTCNT(0)

Export Instruction:03 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3) BRSTCNT(1) 03 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3) BRSTCNT(2) 03 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3) BRSTCNT(3)

115.2GB/s Peak

Page 16: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1616 | ATI Stream Computing – ATI Intermediate Language (IL)

Example 3 – Scratch Buffer

il_ps_2_0

dcl_input_position_interp(linear_noperspective)_centered_center vWinCoord0.xy__

dcl_output_generic o0

dcl_indexed_temp_array x0[2]

dcl_cb cb0[1]

mov r6, r6.0000

flr r5, vWinCoord0.xy

ftoi r0.x, cb0[0].y

ftoi r2.x, cb0[0].z

mad r3, r5.y, cb0[0].x, r5.x

mad r4, r5.y, cb0[0].x, r5.x

mov x0[r0.x], r3

mov x0[r2.x], r4

add r0.x, r0.x, cb0[0].y

add r2.x, r2.x, cb0[0].y

add r5, x0[r0.x], x0[r2.x]

add r6, r5, r6

mov o0, r6

end

Page 17: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1717 | ATI Stream Computing – ATI Intermediate Language (IL)

Example 4 – LDS & Shared Registers

il_cs_2_0

dcl_cb cb0[1]

dcl_num_thread_per_group 64

dcl_lds_size_per_thread 4

dcl_lds_sharing_mode _wavefrontRel

dcl_literal l0, 64, 64, 64, 4

iadd r0, vTid0.x0, cb0[0].x0

mov r2, r2.0000

iadd r0.x, r0.x, cb0[0].y

iadd r0.y, r0.y, l0.w

and r0.x, r0.x, l0.x

lds_read_vec r1, r0.xy

fence_lds_threads

add r2, r2, r1

lds_write_vec mem, r2

end

il_cs_2_0

dcl_cb cb0[1]

dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

dcl_num_thread_per_group 64

dcl_shared_temp sr1

dcl_lds_size_per_thread 4

dcl_literal l0, 0, 0, 0, 0

dcl_literal l1, 0, 1, 41, 0x000000FF

mov r0, r0.0000

if_logicalz vTgroupid0.x

mov sr0.x, vaTid0.x

mov r0.x, sr0.x

else

ieq r1.w, vTgroupid0.x, l1.y

cmov_logical r0.x, r1.w, sr0.x, l1.z

endif

mov r0.z, vTgroupid0.x

mov g[vaTid0.x], r0

ret

endmain

end

Page 18: ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008

| ATI Stream Computing Update | Confidential1818 | ATI Stream Computing – ATI Intermediate Language (IL)

Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft and DirectX are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.