SIMD - UCSBtyang/class/240a17/slides/SIMD.pdf · SIMD: Single Instruction, Multiple Data + •...

SIMDProgramming

CS240A, 2017

Flynn*Taxonomy,1966

• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization

primitives• SIMD(akahw-leveldataparallelism):specializedfunction

units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia

(audio/videoprocessing)

*Prof.MichaelFlynn,Stanford

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

SIMD:SingleInstruction,MultipleData

• Scalarprocessing• traditionalmode• oneoperation produces

oneresult

• SIMDprocessing• WithIntelSSE/SSE2• SSE=streamingSIMDextensions• oneoperation producesmultipleresults

+x3 x2 x1 x0

y3 y2 y1 y0

x3+y3 x2+y2 x1+y1 x0+y0

SlideSource:AlexKlimovitski&DeanMacri,IntelCorporation

Whatdoesthismeantoyou?• InadditiontoSIMDextensions,theprocessormayhave

otherspecialinstructions– FusedMultiply-Add(FMA)instructions:

x=y+c*zissocommonsomeprocessorexecutethemultiply/addasasingleinstruction,atthesamerate(bandwidth)as+or*alone

• Intheory,thecompilerunderstandsallofthis– Whencompiling,itwillrearrangeinstructionstogetagood

“schedule”thatmaximizespipelining,usesFMAsandSIMD– Itworkswiththemixofinstructionsinsideaninnerloopor

otherblockofcode• Butinpracticethecompilermayneedyourhelp

– Chooseadifferentcompiler,optimizationflags,etc.– Rearrangeyourcodetomakethingsmoreobvious– Usingspecialfunctions(“intrinsics”)orwriteinassemblyL

IntelSIMDExtensions• MMX64-bitregisters,reusingfloating-pointregisters[1992]

• SSE2/3/4,new8128-bitregisters[1999]

• AVX,new256-bitregisters[2011]– Spaceforexpansionto1024-bitregisters

SSE/SSE2SIMDonIntel

16xbytes

4xfloats

2xdoubles

• SSE2datatypes:anythingthatfitsinto16bytes,e.g.,

• Instructionsperformadd,multiplyetc.onallthedatainparallel

• SimilaronGPUs,vectorprocessors(butmanymoresimultaneousoperations)

IntelArchitectureSSE2+128-BitSIMDDataTypes

9695 161548478079122121

6463 32319695 161548478079122121 16/128bits

8/128bits

4/128bits

2/128bits

• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)

PackedandScalarDouble-PrecisionFloating-PointOperations

Packed

Scalar

SSE/SSE2FloatingPointInstructions

xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisalignedinmemory{U}meansthe128-bitoperandisunalignedinmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand

Movedoesbothloadand

Example:SIMDArrayProcessing

for each f in arrayf = sqrt(f) for each f in array

{load f to floating-point registercalculate the square rootwrite the result from the

register to memory}

for each 4 members in array{

load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory

}SIMDstyle

Data-LevelParallelismandSIMD

• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel

• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailableinasingleiterationofaloop?

• Unrollloopandadjustiterationrate

LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoit

yourselfinCfor(i=1000; i>0; i=i-1)

x[i] = x[i] + s;

• Couldberewrittenfor(i=1000; i>0; i=i-4) {

x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;

GeneralizingLoopUnrolling

• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0

– Thenwewillruntheloopwith1copyofthebody (nmodk)times

– andthenwithkcopiesofthebodyfloor(n/k)times

GeneralLoopUnrollingwithaHead

• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)

x[i] = x[i] + s;

• Couldberewrittenfor(i=1003;i>1000;i--)//Handlethe head(1003mod4)

x[i] = x[i] + s;

for(i=1000; i>0; i=i-4) {// handleotheriterationsx[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;

Tailmethodforgeneralloopunrolling

• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)

x[i] = x[i] + s;• Couldberewritten

for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {x[i] = x[i] + s;

x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}

for(i=1003mod4;i>0;i--)//specialhandleintailx[i] = x[i] + s;

Anotherloopunrollingexample

Normalloop Afterloopunrolling

int x;for (x= 0;x< 103;x++){

delete(x);}

int x;for (x= 0;x< 103/5*5;x+= 5){delete(x);delete(x+ 1);delete(x+ 2);delete(x+ 3);delete(x+ 4);}/*Tail*/for (x=103/5*5;x<103;x++){

delete(x);}

IntelSSEIntrinsics

• Vectordatatype:_m128d

• Loadandstoreoperations:_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble

• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating

• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble

CorrespondingSSEinstructions:Instrinsics:

IntrinsicsareCfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions

Example1:UseofSSESIMDinstructions

• For(i=0;i<n;i++)sum=sum+a[i];• Set128-bittemp=0;

For(i =0;n/4*4;i=i+4){Add4integerswith128bitsfrom&a[i]totemp;}

Tail:Copyout4integersoftempandaddthemtogethertosum.For(i=n/4*4;i<n;i++)sum+=a[i];

RelatedSSESIMDinstructions__m128i_mm_setzero_si128() returns128-bitzerovector

__m128i_mm_loadu_si128(__m128i*p) Loaddatastoredatpointerpof memorytoa 128bitvector,returnsthisvector.

__m128i_mm_add_epi32(__m128ia,__m128ib) returnsvector(a0+b0,a1+b1,a2+b2,a3+b3)

void_mm_storeu_si128(__m128i*p,__m128ia)

storescontentoff128-bitvector”a”atomemorystartingatpointerp

RelatedSSESIMDinstructions

• Add4integerswith128bitsfrom&a[i]totempvectorwithloopbodytemp=temp+a[i]

• Add128bits,thennext128bits…

__m128itemp=_mm_setzero_si128();__m128itemp1=_mm_loadu_si128((__m128i*)(a+i));temp=_mm_add_epi32(temp,temp1)

Example2:2x2MatrixMultiply

Ci,j =(A×B)i,j =∑ Ai,k× Bk,j

DefinitionofMatrixMultiply:

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3

C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4

Example:2x 2MatrixMultiply

• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg

C1,1C1,2

C2,1C2,2

StoredinmemoryinColumnorder

Bi,1Bi,2

A A1,i A2,i

C1,1 C1,2

C2,1 C2,2

• Initialization

• I=1

B1,1B1,2

A A1,1 A2,1 _mm_load_pd:StoredinmemoryinColumnorder

_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister

• Initialization

• I=1

B1,1B1,2

A A1,1 A2,1 _mm_load_pd:Load2doublesintoXMMreg,StoredinmemoryinColumnorder

_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

Example:2x2MatrixMultiply

• Firstiterationintermediateresult

• I=1

B1,1B1,2

0+A1,1B1,10+A1,1B1,2

0+A2,1B1,10+A2,1B1,2

c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructionsfirstdoparallelmultipliesandthenparalleladdsinXMMregisters

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

A1,1 A1,2

A2,1 A2,2

B1,1 B1,2

B2,1 B2,2

C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2

C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• Firstiterationintermediateresult

• I=2

0+A1,1B1,10+A1,1B1,2

0+A2,1B1,10+A2,1B1,2

B2,1B2,2

A A1,2 A2,2_mm_load_pd:StoredinmemoryinColumnorder

• Seconditerationintermediateresult

• I=2

A1,1B1,1+A1,2B2,1A1,1B1,2+A1,2B2,2

A2,1B1,1+A2,2B2,1A2,1B1,2+A2,2B2,2

B2,1B2,2

Example:2x2MatrixMultiply(Part1of2)

#include<stdio.h>//headerfileforSSEcompilerintrinsics#include<emmintrin.h>

//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]

//wherev1isavariableoftype__m128danda,b aredoubles

int main(void){//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned(16)));doubleC[4]__attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;

//InitializeA,B,Cforexample/*A=(notecolumnorder!)

1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;

/*B= (notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;

/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;

Example:2x 2MatrixMultiply(Part2of2)

//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);

for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);

/*c1=i =0:[c_11+a_11*b_11|c_21+a_21*b_11]i =1:[c_11+a_21*b_21|c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12|c_22+a_21*b_12]i =1:[c_12+a_21*b_22|c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));

//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);

//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;

Conclusion

• FlynnTaxonomy• IntelSSESIMDInstructions

– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously

– 128-bitXMMregisters• SSEInstructionsinC

– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics

– Achieveefficiencybeyondthatofoptimizingcompiler

SIMD - UCSBtyang/class/240a17/slides/SIMD.pdf · SIMD: Single Instruction, Multiple Data + •...

Documents

Dynamic programming for simd

SIMD ABSTRACTS · Abstracts / Molecular Genetics and Metabolism 93 (2008) 221–268 225 The 2007 Emmanuel Shapira SIMD Award The Emmanuel Shapira SIMD Award was established in 2003

Www.scotland.gov.uk/simd Presentation Outline SIMD Background SIMD 2009 Methodology SIMD 2009 Results Where to find more information Questions

SIMD Computer organizations

How to Write Fast Code SIMD Vectorizationusers.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf · How to Write Fast Code SIMD Vectorization 18-645, spring 2008 ... min, max, /,

SIMD Models PDA Sp07

SIMD Programming and What You Must Know about …Contents 1 Introduction 2 SIMD Instructions 3 SIMD programming alternatives Auto loop vectorization OpenMP SIMD Directives GCC’s

ESE345 DLP and SIMDmidor/ESE345/ESE345_DLP and SIMD.pdf · 5 X86 SIMD Processing Extensions (2) SSE3 – 2004 , SSE4 -2007 AVX (Advanced Vector Extensions) – proposed by Intel and

MIPS® SIMD Architecture - s3-eu-west-1.amazonaws.com · MIPS® SIMD Architecture (MSA) is designed to support general purpose Single Instruction Multiple Data (SIMD) processing using

SIMD 2016 - Alistair McAlpine

POWER Block Course Assignment2: SIMD withAltiVec · 2018-12-07 · 1. SIMD & AltiVec AltiVecis SIMD on POWER Sven Köhler, 07.12.2018 SIMD & AltiVec Chart 5 Single Instruction Multiple

26-27 SIMD Architecture

Lecture 21: Data Level Parallelism -- SIMD ISA Extensions for … · 2017-06-21 · 6 SIMD: Single Instruction, Multiple Data (Data Level Paralleism) § SIMD architectures can exploit

HOG and Spatial Convolution on SIMD Architectureimisra/projects/simd-hog-tech-report.pdf · to the Single Instruction Multiple Data (SIMD) paradigm. Although these instructions have

Public-key Cryptography on SIMD Mobile Devices · PDF filePublic-key Cryptography on SIMD Mobile Devices ... e suportados em Curvas El´ ´ıticas ... 5.2 SIMD instructions adopted

Data-Parallel Execution using SIMD Instructions · Data-Parallel Execution using SIMD Instructions Intrinsics intrinsics provide an interface to SIMD instructions without writing

Maximizing SIMD Resource Utilization in GPGPUs with SIMD

Data-Level Parallelism in Vector and SIMD Architectures · 2016-04-14 · SIMD vs MIMD SIMD architectures can exploit significant data-level parallelism for: matrix-oriented scientific

What is SIMD?

SIMD Compression and the Intersection of Sorted … · SIMD Compression and the Intersection of Sorted Integers ... SIMD COMPRESSION AND THE INTERSECTION OF SORTED ... engine and