Upload
richard-thomson
View
112
Download
0
Embed Size (px)
Citation preview
SIMD ProcessingUsing Compiler Intrinsics
Richard [email protected]
@LegalizeAdulthdgithub.com/LegalizeAdulthood
SIMDSingleInstructionMultipleData
SIMD Exploits Data ParallelismImage ProcessingArray ProcessingScientific Computing3D Graphics
Brief History of CPU SIMDYear Extension Register Size1997 MMX 64 bits
1999 SSE 128 bits
2001 SSE2 128 bits
2004 SSE3 128 bits
2006 SSE4 128 bits
2008 AVX 256 bits
2015 AVX-512 512 bits
Data Types8-bit integers16-bit integers32-bit integers64-bit integers
16-bit floats32-bit floats64-bit floats
Multiple smaller quantities are packed into registers ("multiple data")
Alignment requirements on data
Older extensions do not support all data types
Alignment C++11struct alignas(16) foo{
int i; // 4 bytesint j; // 4 bytesalignas(4) char s[3]; // 3 bytesshort q; // 2 bytes
};// outputs 16:std::cout << alignof(foo) << '\n';
Alignment C++03// pre-C++11// MSVC:struct __declspec(align(16)) foo{
// ...};
// gcc:struct foo __attribute__((aligned(16))){
// ...};
Boost.AlignHandles heap allocation of aligned memory
Query the alignment requirements of a type
Declare alignment to the compiler portably
Compiler IntrinsicsA function whose implementation is handled
directly by the compiler.SIMD registers exposed as data types
__m64, __m128, __m128d, __m128i, etc.SIMD instructions exposed as intrinsic
functions_m_paddb, _m_paddd, _m_paddsb, etc.
Register allocation, instruction scheduling and addressing modes handled by the compiler
Proper alignment of operands is assumed
Options AvailableAssembly
Intrinsics
Class Library
Automatic Vectorization
+ Direct control,- Hard to program
+ Pure C/C++,- Hard to program
+ Easier to program,- Less control- Very little control
Proposed Boost.Simdhttps://github.com/NumScale/boost.simdSeems promising; easier to program without loss of
control?I had problems using it on Windows (issue #189)Abstracts away the different sizes of registers as
packsProvides facilities to deal with alignmentProvides natural syntax for manipulating packs, i.e.
a+b adds two packs togetherSingle code base can target multiple extensionsTemplates expand to calls to intrinsics
Group ExerciseConvert BasicMandel to use intrinsicsAVX packs 8 32-bit floats to a single 256-bit
registerAVX Intrinsics:
#include <immintrin.h> __m256 _mm256_add_ps(__m256 a, __m256 b) __m256 _m256_mul_ps(__m256 a, __m256 b) __m256 _m256_sub_ps(__m256 a, __m256 b) __m256 _mm256_load_ps(float const *c) __m256 _mm256_cmp_ps(__m256 a, __m256 b, const int compOp) __m256i _mm256_castps_si256(__m256 a)
Intel Intrinsics Guide