CPU SIMD Quo vadis?vgsem/data/leto2009/SIMD_on_CPU.pdf · 15/31 SSE2 - instructions New data type – two double-precision FP 144 new instructions: scalar and packed double-precision

CPU SIMDCPU SIMDQuo vadis? Quo vadis?

Technological OverviewTechnological [email protected]@gmail.com

mailto:[email protected]?subject=3D%20Gauss%20Filtering%20Using%20SSE

2/31

ContentContent

SIMDSIMD Existing SIMD technologiesExisting SIMD technologies Future SIMD technologiesFuture SIMD technologies ConclusionConclusion

3/31

SIMDSIMD One of 4 computer architectures (Flynn)One of 4 computer architectures (Flynn) Single Instruction Single DataSingle Instruction Single Data Single Instruction Multiple DataSingle Instruction Multiple Data

4/31

MMXMMX MMultiultiMMedia eedia eXXtensiontension Intel’s technologyIntel’s technology Introduced in P5 microarchitecture Introduced in P5 microarchitecture

(Pentium) in January 1997(Pentium) in January 1997 Later supported also by AMD (since K6-2)Later supported also by AMD (since K6-2) Only integer calculations or fixed-point Only integer calculations or fixed-point

floating point (FP)floating point (FP)

5/31

MMX - registersMMX - registers 8 new registers MM0 – MM7, 64 bits wide8 new registers MM0 – MM7, 64 bits wide Physically mirrored in FP stackPhysically mirrored in FP stack

6/31

MMX – data typesMMX – data types4 new data types (64 bits wide):4 new data types (64 bits wide):

1.1. 8 packed bytes8 packed bytes2.2. 4 packed words4 packed words3.3. 2 packed doublewords2 packed doublewords4.4. 1 quadword1 quadword

7/31

MMX - instructionsMMX - instructions57 new instructions:57 new instructions:1.1. arithmetic (padd, psub, pmul ...)arithmetic (padd, psub, pmul ...)2.2. comparison (pcmpeq, pcmpgt)comparison (pcmpeq, pcmpgt)3.3. conversion (pack, punpck)conversion (pack, punpck)4.4. logical (pand, pandn, por, pxor)logical (pand, pandn, por, pxor)5.5. shift (psll, psrl, psra)shift (psll, psrl, psra)6.6. data transfer (movq, movd)data transfer (movq, movd)7.7. state management (emms)state management (emms)

8/31

3DNow!3DNow! AMD’s technology (reply to MMX)AMD’s technology (reply to MMX) MMX extensionMMX extension Introduced in K6-2 microarchitecture in Introduced in K6-2 microarchitecture in

May 1998May 1998 Not supported by IntelNot supported by Intel Integer and single-precision FP Integer and single-precision FP

calculationscalculations New data type – two single-precision floats New data type – two single-precision floats

packed in mm registerpacked in mm register

9/31

3DNow! - instructions3DNow! - instructions21 new instructions:21 new instructions:1.1. arithmetic (pfadd, pfsub, pfmul, pfmin, ...)arithmetic (pfadd, pfsub, pfmul, pfmin, ...)2.2. horizontalhorizontal accumulation (pfacc) accumulation (pfacc)3.3. approximations (pfrcp, pfrsqrt+iterations)approximations (pfrcp, pfrsqrt+iterations)4.4. comparison (pfcmpeq, …)comparison (pfcmpeq, …)5.5. conversion float-integer(pi2fd, pf2id)conversion float-integer(pi2fd, pf2id)6.6. state management (femms)state management (femms)7.7. prefetching into L1prefetching into L1

10/31

3DNow!+3DNow!+ Extension of 3DNow! Extension of 3DNow! Introduced in June 1999 in K7 (Athlon)Introduced in June 1999 in K7 (Athlon) 5 new FP instructions (pf2iw, pi2fw, 5 new FP instructions (pf2iw, pi2fw,

pswapd, pfnacc, pfpnacc)pswapd, pfnacc, pfpnacc) 19 new MMX instructions:19 new MMX instructions:

mask moves (also non-temporal)mask moves (also non-temporal) extracting/inserting from/to regs, shufflingextracting/inserting from/to regs, shuffling prefetching into different cache levelsprefetching into different cache levels min, max, avg, absolute differencesmin, max, avg, absolute differences

11/31

SSESSE SStreaming treaming SSIMD IMD EExtensionxtension Intel’s reply to 3DNow!Intel’s reply to 3DNow! Introduced in P6 microarchitecture Introduced in P6 microarchitecture

(Pentium III) in February 1999(Pentium III) in February 1999 Later supported also by AMD (since Later supported also by AMD (since

October 2001 in Athlon XP)October 2001 in Athlon XP) Single-precision floating point calculationsSingle-precision floating point calculations

12/31

SSE – data types and registersSSE – data types and registers

8 physically new 128-bit XMM registers (16 in 8 physically new 128-bit XMM registers (16 in 64-bit OS)64-bit OS)

OS supportOS support new data types:new data types:

16 bytes16 bytes 8 words8 words 4 dwords/single FP4 dwords/single FP

13/31

SSE - instructionsSSE - instructions 70 new instructions:70 new instructions:

scalar and packed single-precision FP:scalar and packed single-precision FP:1.1. arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)2.2. comparison (cmpss, …)comparison (cmpss, …)3.3. conversion float-int, truncation (cvtpi2ps, cvttps2pi, …)conversion float-int, truncation (cvtpi2ps, cvttps2pi, …)4.4. logical (andps, andnps, orps, xorps)logical (andps, andnps, orps, xorps)

integer instructions (pmin, pavg, psad, ...)integer instructions (pmin, pavg, psad, ...) data movement (insert, extract, various mov*s)data movement (insert, extract, various mov*s) data shuffle and unpacking (shufps, ...)data shuffle and unpacking (shufps, ...) state and cache memory managementstate and cache memory management

14/31

SSE2SSE2 22ndnd iteration of the SSE instruction set iteration of the SSE instruction set Intel’s technologyIntel’s technology Introduced in NetBurst microarchitecture Introduced in NetBurst microarchitecture

(Pentium 4) in December 2000(Pentium 4) in December 2000 Supported also by AMD since 2003 (in Supported also by AMD since 2003 (in

Athlon64)Athlon64) 128-bit integer (MMX) and double-128-bit integer (MMX) and double-

precision floating point calculationsprecision floating point calculations

15/31

SSE2 - instructionsSSE2 - instructions New data type – two double-precision FPNew data type – two double-precision FP 144 new instructions:144 new instructions:

scalar and packed double-precision FP:scalar and packed double-precision FP:1.1. arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)arithmetic (add/sub, mul/div, rcp, sqrt/rsqrt, min/max)2.2. comparison (cmpsd, …)comparison (cmpsd, …)3.3. conversion double-int, double-single FP, truncationconversion double-int, double-single FP, truncation4.4. logical (andpd, andnpd, orpd, xorpd)logical (andpd, andnpd, orpd, xorpd)

64-bit MMX instructions extended to 128-bit64-bit MMX instructions extended to 128-bit data movement, unaligned loadsdata movement, unaligned loads data shuffle and unpackingdata shuffle and unpacking cache memory managementcache memory management

16/31

SSE3SSE3 33rdrd iteration of the SSE instruction set iteration of the SSE instruction set Intel’s technologyIntel’s technology Introduced in Pentium 4 (Prescott) in Introduced in Pentium 4 (Prescott) in

February 2004February 2004 Supported also by AMD since 2004 (in Supported also by AMD since 2004 (in

Athlon64)Athlon64)

17/31

SSE3 - instructionsSSE3 - instructions 13 new instructions:13 new instructions:

Arithmetic (addsubps, addsubpd}Arithmetic (addsubps, addsubpd} AOS (haddps, haddpd, hsubps, hsubpd)AOS (haddps, haddpd, hsubps, hsubpd) Loading (lddqu, mov{d|sh|sl}dup)Loading (lddqu, mov{d|sh|sl}dup) Optimizing Hyper-ThreadingOptimizing Hyper-Threading

18/31

SSSE3SSSE3 SSupplemental upplemental SSE3SSE3 Intel’s revision of SSE3Intel’s revision of SSE3 Introduced in Core microarchitecture in Introduced in Core microarchitecture in

July 2006July 2006 Not supported by AMDNot supported by AMD 16x2 new discrete instructions (MMX and 16x2 new discrete instructions (MMX and

XMM)XMM)

19/31

SSSE3 - instructionsSSSE3 - instructions Arithmetic:Arithmetic:

psign, pabspsign, pabs pmaddsubsw (complex arithmetic)pmaddsubsw (complex arithmetic)

AOS:AOS: phadd, phsub – words, dwords, saturated phadd, phsub – words, dwords, saturated

signed wordssigned words Other:Other:

palignr, pshufbpalignr, pshufb

20/31

SSE4SSE4 44thth iteration of the SSE instruction set iteration of the SSE instruction set Subsets:Subsets:

SSE4.1SSE4.1 SSE4.2SSE4.2 SSE4aSSE4a

New instructions based on feedback from New instructions based on feedback from developersdevelopers

21/31

SSE4.1SSE4.1 Available in Penryn (and higher)Available in Penryn (and higher) Intel onlyIntel only 47 new instructions:47 new instructions:

Arithmetic instruction for various integer data Arithmetic instruction for various integer data types (pmul, pmin)types (pmul, pmin)

Dot product for floating-point (dpps, dppd)Dot product for floating-point (dpps, dppd) Rounding FPs via immediate argumentRounding FPs via immediate argument Blending, insert/extract instructionsBlending, insert/extract instructions Other instructions (movntdqa, …)Other instructions (movntdqa, …)

22/31

SSE4.2SSE4.2 Available in Core i7 (and higher)Available in Core i7 (and higher) Intel onlyIntel only Not multimedia instructionsNot multimedia instructions 7 new instructions:7 new instructions:

String comparison (pcmpestr, pcmpistr)String comparison (pcmpestr, pcmpistr) crc32crc32 popcnt (population count)popcnt (population count)

23/31

SSE4aSSE4a Available in K10 (and higher)Available in K10 (and higher) AMD onlyAMD only Not multimedia instructionsNot multimedia instructions 4+2 new instructions:4+2 new instructions:

lzcnt (leading zero count)lzcnt (leading zero count) popcnt (population count)popcnt (population count) extract/insertextract/insert movntss/movntsd (scalar streaming store)movntss/movntsd (scalar streaming store)

24/31

SSE4 SummarySSE4 Summary SSE4a = 4 instructions from SSE4.1 + 2 SSE4a = 4 instructions from SSE4.1 + 2

new instructionsnew instructions SSE4 supporting madness:SSE4 supporting madness:

SSE4.1 – Intel Penryn (and higher)SSE4.1 – Intel Penryn (and higher) SSE4.2 – Intel Core i7 (and higher)SSE4.2 – Intel Core i7 (and higher) SSE4a – AMD K10 (and higher)SSE4a – AMD K10 (and higher)

Programmers (and compilers) nightmare!Programmers (and compilers) nightmare!

25/31

SSE5SSE5 55thth iteration of the SSE instruction set iteration of the SSE instruction set AMD’s reply to SSE4.1AMD’s reply to SSE4.1 Should be introduced in Bulldozer in 2011Should be introduced in Bulldozer in 2011 Will NOT be supported by IntelWill NOT be supported by Intel 170 new instructions170 new instructions Not all SSE4 instructions includedNot all SSE4 instructions included Not superset but competitor to SSE4Not superset but competitor to SSE4

26/31

SSE5 – Key FeaturesSSE5 – Key Features Three and four operand syntaxThree and four operand syntax Floating-point fused multiply-accumulateFloating-point fused multiply-accumulate Integer multiply-accumulate with/without Integer multiply-accumulate with/without

saturationsaturation Permutations, rotations, shifts, conditional Permutations, rotations, shifts, conditional

movesmoves Precision control, rounding, conversionsPrecision control, rounding, conversions

27/31

SSE SummarySSE Summary 8 SSE revisions have 471 instructions8 SSE revisions have 471 instructions

28/31

AVXAVX Intel Intel AAdvanced dvanced VVector ector EExtensionsxtensions Intel’s competitor to SSE5Intel’s competitor to SSE5 Should be introduced in Sandy Bridge in Should be introduced in Sandy Bridge in

20102010 Will be supported by AMD? Will be supported by AMD? Flexible design for further extensions Flexible design for further extensions

(FMA)(FMA)

29/31

AVX – Key FeaturesAVX – Key Features New 256-bit registers (YMM)New 256-bit registers (YMM) Three and four operand instruction syntaxThree and four operand instruction syntax Upgrade of some SSEn instructions(>200)Upgrade of some SSEn instructions(>200) Promoting some 128-bit SSEn instructions Promoting some 128-bit SSEn instructions

to 256-bit instructions (<100)to 256-bit instructions (<100) New arithmetic and data processing New arithmetic and data processing

instructions (encryption, broadcast, instructions (encryption, broadcast, permute, fused-multiply-add)permute, fused-multiply-add)

30/31

ConclusionConclusion MMX, 3DNow! are oldMMX, 3DNow! are old Intel & AMD: SSE, SSE2, SSE3Intel & AMD: SSE, SSE2, SSE3 SSEn will be (soon) replaced by AVXSSEn will be (soon) replaced by AVX SSE5 is incompatible with AVXSSE5 is incompatible with AVX AVX has better design than SSE5AVX has better design than SSE5 AVX = Intel’s vengeance for AMD64AVX = Intel’s vengeance for AMD64

Thank you for your Thank you for your attention !attention !

[email protected]@gmail.com

mailto:[email protected]?subject=3D%20Gauss%20Filtering%20Using%20SSE

Documents

CPU SIMD Quo vadis?vgsem/data/leto2009/SIMD_on_CPU.pdf · 15/31 SSE2 - instructions New data type – two double-precision FP 144 new instructions: scalar and packed double-precision