12
SIMD Processing Using Compiler Intrinsics Richard Thomson [email protected] @LegalizeAdulthd github.com/LegalizeAdulthood

SIMD Processing Using Compiler Intrinsics

Embed Size (px)

Citation preview

Page 1: SIMD Processing Using Compiler Intrinsics

SIMD ProcessingUsing Compiler Intrinsics

Richard [email protected]

@LegalizeAdulthdgithub.com/LegalizeAdulthood

Page 2: SIMD Processing Using Compiler Intrinsics

SIMDSingleInstructionMultipleData

Page 3: SIMD Processing Using Compiler Intrinsics

SIMD Exploits Data ParallelismImage ProcessingArray ProcessingScientific Computing3D Graphics

Page 4: SIMD Processing Using Compiler Intrinsics

Brief History of CPU SIMDYear Extension Register Size1997 MMX 64 bits

1999 SSE 128 bits

2001 SSE2 128 bits

2004 SSE3 128 bits

2006 SSE4 128 bits

2008 AVX 256 bits

2015 AVX-512 512 bits

Page 5: SIMD Processing Using Compiler Intrinsics

Data Types8-bit integers16-bit integers32-bit integers64-bit integers

16-bit floats32-bit floats64-bit floats

Multiple smaller quantities are packed into registers ("multiple data")

Alignment requirements on data

Older extensions do not support all data types

Page 6: SIMD Processing Using Compiler Intrinsics

Alignment C++11struct alignas(16) foo{

int i; // 4 bytesint j; // 4 bytesalignas(4) char s[3]; // 3 bytesshort q; // 2 bytes

};// outputs 16:std::cout << alignof(foo) << '\n';

Page 7: SIMD Processing Using Compiler Intrinsics

Alignment C++03// pre-C++11// MSVC:struct __declspec(align(16)) foo{

// ...};

// gcc:struct foo __attribute__((aligned(16))){

// ...};

Page 8: SIMD Processing Using Compiler Intrinsics

Boost.AlignHandles heap allocation of aligned memory

Query the alignment requirements of a type

Declare alignment to the compiler portably

Page 9: SIMD Processing Using Compiler Intrinsics

Compiler IntrinsicsA function whose implementation is handled

directly by the compiler.SIMD registers exposed as data types

__m64, __m128, __m128d, __m128i, etc.SIMD instructions exposed as intrinsic

functions_m_paddb, _m_paddd, _m_paddsb, etc.

Register allocation, instruction scheduling and addressing modes handled by the compiler

Proper alignment of operands is assumed

Page 10: SIMD Processing Using Compiler Intrinsics

Options AvailableAssembly

Intrinsics

Class Library

Automatic Vectorization

+ Direct control,- Hard to program

+ Pure C/C++,- Hard to program

+ Easier to program,- Less control- Very little control

Page 11: SIMD Processing Using Compiler Intrinsics

Proposed Boost.Simdhttps://github.com/NumScale/boost.simdSeems promising; easier to program without loss of

control?I had problems using it on Windows (issue #189)Abstracts away the different sizes of registers as

packsProvides facilities to deal with alignmentProvides natural syntax for manipulating packs, i.e.

a+b adds two packs togetherSingle code base can target multiple extensionsTemplates expand to calls to intrinsics

Page 12: SIMD Processing Using Compiler Intrinsics

Group ExerciseConvert BasicMandel to use intrinsicsAVX packs 8 32-bit floats to a single 256-bit

registerAVX Intrinsics:

#include <immintrin.h> __m256 _mm256_add_ps(__m256 a, __m256 b) __m256 _m256_mul_ps(__m256 a, __m256 b) __m256 _m256_sub_ps(__m256 a, __m256 b) __m256 _mm256_load_ps(float const *c) __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp) __m256i _mm256_castps_si256(__m256 a)

Intel Intrinsics Guide