SIMD Processing Using Compiler Intrinsics

SIMD ProcessingUsing Compiler Intrinsics

Richard [email protected]

@LegalizeAdulthdgithub.com/LegalizeAdulthood

SIMDSingleInstructionMultipleData

SIMD Exploits Data ParallelismImage ProcessingArray ProcessingScientific Computing3D Graphics

Brief History of CPU SIMDYear Extension Register Size1997 MMX 64 bits

1999 SSE 128 bits

2001 SSE2 128 bits

2004 SSE3 128 bits

2006 SSE4 128 bits

2008 AVX 256 bits

2015 AVX-512 512 bits

Data Types8-bit integers16-bit integers32-bit integers64-bit integers

16-bit floats32-bit floats64-bit floats

Multiple smaller quantities are packed into registers ("multiple data")

Alignment requirements on data

Older extensions do not support all data types

Alignment C++11struct alignas(16) foo{

int i; // 4 bytesint j; // 4 bytesalignas(4) char s[3]; // 3 bytesshort q; // 2 bytes

};// outputs 16:std::cout << alignof(foo) << '\n';

Alignment C++03// pre-C++11// MSVC:struct __declspec(align(16)) foo{

// ...};

// gcc:struct foo __attribute__((aligned(16))){

// ...};

Boost.AlignHandles heap allocation of aligned memory

Query the alignment requirements of a type

Declare alignment to the compiler portably

Compiler IntrinsicsA function whose implementation is handled

directly by the compiler.SIMD registers exposed as data types

__m64, __m128, __m128d, __m128i, etc.SIMD instructions exposed as intrinsic

functions_m_paddb, _m_paddd, _m_paddsb, etc.

Register allocation, instruction scheduling and addressing modes handled by the compiler

Proper alignment of operands is assumed

Options AvailableAssembly

Intrinsics

Class Library

Automatic Vectorization

+ Direct control,- Hard to program

+ Pure C/C++,- Hard to program

+ Easier to program,- Less control- Very little control

Proposed Boost.Simdhttps://github.com/NumScale/boost.simdSeems promising; easier to program without loss of

control?I had problems using it on Windows (issue #189)Abstracts away the different sizes of registers as

packsProvides facilities to deal with alignmentProvides natural syntax for manipulating packs, i.e.

a+b adds two packs togetherSingle code base can target multiple extensionsTemplates expand to calls to intrinsics

https://github.com/NumScale/boost.simd

https://github.com/NumScale/boost.simd

Group ExerciseConvert BasicMandel to use intrinsicsAVX packs 8 32-bit floats to a single 256-bit

registerAVX Intrinsics:

#include <immintrin.h> __m256 _mm256_add_ps(__m256 a, __m256 b) __m256 _m256_mul_ps(__m256 a, __m256 b) __m256 _m256_sub_ps(__m256 a, __m256 b) __m256 _mm256_load_ps(float const *c) __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp) __m256i _mm256_castps_si256(__m256 a)

Intel Intrinsics Guide

https://software.intel.com/sites/landingpage/IntrinsicsGuide/



Software

SIMD Processing Using Compiler Intrinsics