View
17
Download
0
Category
Preview:
Citation preview
SIMDProgramming
CS240A, 2017
1
Flynn*Taxonomy,1966
• In2013,SIMDandMIMDmostcommonparallelisminarchitectures– usuallybothinsamesystem!
• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)– SingleprogramthatrunsonallprocessorsofaMIMD– Cross-processorexecutioncoordinationusingsynchronization
primitives• SIMD(akahw-leveldataparallelism):specializedfunction
units,forhandlinglock-stepcalculationsinvolvingarrays– Scientificcomputing,signalprocessing,multimedia
(audio/videoprocessing)
2
*Prof.MichaelFlynn,Stanford
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
3
4
SIMD:SingleInstruction,MultipleData
+
• Scalarprocessing• traditionalmode• oneoperation produces
oneresult
• SIMDprocessing• WithIntelSSE/SSE2• SSE=streamingSIMDextensions• oneoperation producesmultipleresults
X
Y
X + Y
+x3 x2 x1 x0
y3 y2 y1 y0
x3+y3 x2+y2 x1+y1 x0+y0
X
Y
X + Y
SlideSource:AlexKlimovitski&DeanMacri,IntelCorporation
5
Whatdoesthismeantoyou?• InadditiontoSIMDextensions,theprocessormayhave
otherspecialinstructions– FusedMultiply-Add(FMA)instructions:
x=y+c*zissocommonsomeprocessorexecutethemultiply/addasasingleinstruction,atthesamerate(bandwidth)as+or*alone
• Intheory,thecompilerunderstandsallofthis– Whencompiling,itwillrearrangeinstructionstogetagood
“schedule”thatmaximizespipelining,usesFMAsandSIMD– Itworkswiththemixofinstructionsinsideaninnerloopor
otherblockofcode• Butinpracticethecompilermayneedyourhelp
– Chooseadifferentcompiler,optimizationflags,etc.– Rearrangeyourcodetomakethingsmoreobvious– Usingspecialfunctions(“intrinsics”)orwriteinassemblyL
IntelSIMDExtensions• MMX64-bitregisters,reusingfloating-pointregisters[1992]
• SSE2/3/4,new8128-bitregisters[1999]
• AVX,new256-bitregisters[2011]– Spaceforexpansionto1024-bitregisters
6
7
SSE/SSE2SIMDonIntel
16xbytes
4xfloats
2xdoubles
• SSE2datatypes:anythingthatfitsinto16bytes,e.g.,
• Instructionsperformadd,multiplyetc.onallthedatainparallel
• SimilaronGPUs,vectorprocessors(butmanymoresimultaneousoperations)
IntelArchitectureSSE2+128-BitSIMDDataTypes
86463
6463
6463
3231
3231
9695
9695 161548478079122121
6463 32319695 161548478079122121 16/128bits
8/128bits
4/128bits
2/128bits
• Note:inIntelArchitecture(unlikeMIPS)awordis16bits– Single-precisionFP:Doubleword(32bits)– Double-precisionFP:Quadword(64bits)
PackedandScalarDouble-PrecisionFloating-PointOperations
9
Packed
Scalar
SSE/SSE2FloatingPointInstructions
xmm:oneoperandisa128-bitSSE2registermem/xmm:otheroperandisinmemoryoranSSE2register{SS}ScalarSingleprecisionFP:one32-bitoperandina128-bitregister{PS}PackedSingleprecisionFP:four32-bitoperandsina128-bitregister{SD}ScalarDoubleprecisionFP:one64-bitoperandina128-bitregister{PD}PackedDoubleprecisionFP,ortwo64-bitoperandsina128-bitregister{A}128-bitoperandisalignedinmemory{U}meansthe128-bitoperandisunalignedinmemory{H}meansmovethehighhalfofthe128-bitoperand{L}meansmovethelowhalfofthe128-bitoperand
10
Movedoesbothloadand
store
Example:SIMDArrayProcessing
11
for each f in arrayf = sqrt(f) for each f in array
{load f to floating-point registercalculate the square rootwrite the result from the
register to memory}
for each 4 members in array{
load 4 members to the SSE registercalculate 4 square roots in one operationstore the 4 results from the register to memory
}SIMDstyle
Data-LevelParallelismandSIMD
• SIMDwantsadjacentvaluesinmemorythatcanbeoperatedinparallel
• Usuallyspecifiedinprogramsasloopsfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;• Howcanrevealmoredata-levelparallelismthanavailableinasingleiterationofaloop?
• Unrollloopandadjustiterationrate
12
LoopUnrollinginC• Insteadofcompilerdoingloopunrolling,coulddoit
yourselfinCfor(i=1000; i>0; i=i-1)
x[i] = x[i] + s;
• Couldberewrittenfor(i=1000; i>0; i=i-4) {
x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;
}
13
GeneralizingLoopUnrolling
• Aloopofn iterations• k copiesofthebodyoftheloop• Assuming(n modk)≠0
– Thenwewillruntheloopwith1copyofthebody (nmodk)times
– andthenwithkcopiesofthebodyfloor(n/k)times
14
GeneralLoopUnrollingwithaHead
• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;
• Couldberewrittenfor(i=1003;i>1000;i--)//Handlethe head(1003mod4)
x[i] = x[i] + s;
for(i=1000; i>0; i=i-4) {// handleotheriterationsx[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;
}
15
Tailmethodforgeneralloopunrolling
• Handingloopiterationsindivisiblebystepsize.for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;• Couldberewritten
for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {x[i] = x[i] + s;
x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s;}
for(i=1003mod4;i>0;i--)//specialhandleintailx[i] = x[i] + s;
16
Anotherloopunrollingexample
17
Normalloop Afterloopunrolling
int x;for (x= 0;x< 103;x++){
delete(x);}
int x;for (x= 0;x< 103/5*5;x+= 5){delete(x);delete(x+ 1);delete(x+ 2);delete(x+ 3);delete(x+ 4);}/*Tail*/for (x=103/5*5;x<103;x++){
delete(x);}
IntelSSEIntrinsics
• Vectordatatype:_m128d
• Loadandstoreoperations:_mm_load_pd MOVAPD/aligned,packeddouble_mm_store_pd MOVAPD/aligned,packeddouble_mm_loadu_pd MOVUPD/unaligned,packeddouble_mm_storeu_pd MOVUPD/unaligned,packeddouble
• Loadandbroadcastacrossvector_mm_load1_pd MOVSD+shuffling/duplicating
• Arithmetic:_mm_add_pd ADDPD/add,packeddouble_mm_mul_pd MULPD/multiple,packeddouble
CorrespondingSSEinstructions:Instrinsics:
18
IntrinsicsareCfunctionsandproceduresforinsertingassemblylanguageintoCcode,includingSSEinstructions
19
Example1:UseofSSESIMDinstructions
• For(i=0;i<n;i++)sum=sum+a[i];• Set128-bittemp=0;
For(i =0;n/4*4;i=i+4){Add4integerswith128bitsfrom&a[i]totemp;}
Tail:Copyout4integersoftempandaddthemtogethertosum.For(i=n/4*4;i<n;i++)sum+=a[i];
20
RelatedSSESIMDinstructions__m128i_mm_setzero_si128() returns128-bitzerovector
__m128i_mm_loadu_si128(__m128i*p) Loaddatastoredatpointerpof memorytoa 128bitvector,returnsthisvector.
__m128i_mm_add_epi32(__m128ia,__m128ib) returnsvector(a0+b0,a1+b1,a2+b2,a3+b3)
void_mm_storeu_si128(__m128i*p,__m128ia)
storescontentoff128-bitvector”a”atomemorystartingatpointerp
21
RelatedSSESIMDinstructions
• Add4integerswith128bitsfrom&a[i]totempvectorwithloopbodytemp=temp+a[i]
• Add128bits,thennext128bits…
__m128itemp=_mm_setzero_si128();__m128itemp1=_mm_loadu_si128((__m128i*)(a+i));temp=_mm_add_epi32(temp,temp1)
Example2:2x2MatrixMultiply
Ci,j =(A×B)i,j =∑ Ai,k× Bk,j
2
k =1
DefinitionofMatrixMultiply:
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
1 0
0 1
1 3
2 4
x
C1,1=1*1 +0*2=1 C1,2=1*3+0*4=3
C2,1=0*1 +1*2=2 C2,2=0*3+1*4=4
=
22
Example:2x 2MatrixMultiply
• UsingtheXMMregisters– 64-bit/doubleprecision/twodoublesperXMMreg
C1C2
C1,1C1,2
C2,1C2,2
StoredinmemoryinColumnorder
B1B2
Bi,1Bi,2
Bi,1Bi,2
A A1,i A2,i
C1,1 C1,2
C2,1 C2,2
�
C1 C2
23
Example:2x 2MatrixMultiply
• Initialization
• I=1
C1C2
0
0
0
0
B1B2
B1,1B1,2
B1,1B1,2
A A1,1 A2,1 _mm_load_pd:StoredinmemoryinColumnorder
_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister
24
• Initialization
• I=1
C1C2
0
0
0
0
B1B2
B1,1B1,2
B1,1B1,2
A A1,1 A2,1 _mm_load_pd:Load2doublesintoXMMreg,StoredinmemoryinColumnorder
_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
25
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x2MatrixMultiply
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• I=1
C1C2
B1B2
B1,1B1,2
B1,1B1,2
A A1,1 A2,1 _mm_load_pd:StoredinmemoryinColumnorder
0+A1,1B1,10+A1,1B1,2
0+A2,1B1,10+A2,1B1,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructionsfirstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
26
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
A1,1 A1,2
A2,1 A2,2
B1,1 B1,2
B2,1 B2,2
x
C1,1=A1,1B1,1+A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1+A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
=
Example:2x 2MatrixMultiply
• Firstiterationintermediateresult
• I=2
C1C2
0+A1,1B1,10+A1,1B1,2
0+A2,1B1,10+A2,1B1,2
B1B2
B2,1B2,2
B2,1B2,2
A A1,2 A2,2_mm_load_pd:StoredinmemoryinColumnorder
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructionsfirstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
27
Example:2x 2MatrixMultiply
• Seconditerationintermediateresult
• I=2
C1C2
A1,1B1,1+A1,2B2,1A1,1B1,2+A1,2B2,2
A2,1B1,1+A2,2B2,1A2,1B1,2+A2,2B2,2
B1B2
B2,1B2,2
B2,1B2,2
A A1,2 A2,2 _mm_load_pd:StoredinmemoryinColumnorder
C1,1
C1,2
C2,1
C2,2
c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));SSEinstructionsfirstdoparallelmultipliesandthenparalleladdsinXMMregisters
_mm_load1_pd:SSEinstructionthatloadsadoublewordandstoresitinthehighandlowdoublewordsoftheXMMregister(duplicatesvalueinbothhalvesofXMM)
28
Example:2x2MatrixMultiply(Part1of2)
#include<stdio.h>//headerfileforSSEcompilerintrinsics#include<emmintrin.h>
//NOTE:vectorregisterswillberepresentedincommentsasv1=[a|b]
//wherev1isavariableoftype__m128danda,b aredoubles
int main(void){//allocateA,B,Calignedon16-byteboundariesdoubleA[4]__attribute__((aligned(16)));doubleB[4]__attribute__((aligned(16)));doubleC[4]__attribute__((aligned(16)));int lda =2;int i =0;//declareseveral128-bitvectorvariables__m128dc1,c2,a,b1,b2;
//InitializeA,B,Cforexample/*A=(notecolumnorder!)
1001*/A[0]=1.0;A[1]=0.0;A[2]=0.0;A[3]=1.0;
/*B= (notecolumnorder!)1324*/B[0]=1.0;B[1]=2.0;B[2]=3.0;B[3]=4.0;
/*C=(notecolumnorder!)0000*/C[0]=0.0;C[1]=0.0;C[2]=0.0;C[3]=0.0;
29
Example:2x 2MatrixMultiply(Part2of2)
//usedalignedloadstoset//c1=[c_11|c_21]c1=_mm_load_pd(C+0*lda);//c2=[c_12|c_22]c2=_mm_load_pd(C+1*lda);
for(i =0;i <2;i++){/*a=i =0:[a_11|a_21]i =1:[a_12|a_22]*/a=_mm_load_pd(A+i*lda);/*b1=i =0:[b_11|b_11]i =1:[b_21|b_21]*/b1=_mm_load1_pd(B+i+0*lda);/*b2=i =0:[b_12|b_12]i =1:[b_22|b_22]*/b2=_mm_load1_pd(B+i+1*lda);
/*c1=i =0:[c_11+a_11*b_11|c_21+a_21*b_11]i =1:[c_11+a_21*b_21|c_21+a_22*b_21]*/c1=_mm_add_pd(c1,_mm_mul_pd(a,b1));/*c2=i =0:[c_12+a_11*b_12|c_22+a_21*b_12]i =1:[c_12+a_21*b_22|c_22+a_22*b_22]*/c2=_mm_add_pd(c2,_mm_mul_pd(a,b2));
}
//storec1,c2backintoCforcompletion_mm_store_pd(C+0*lda,c1);_mm_store_pd(C+1*lda,c2);
//printCprintf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);return0;
}
30
Conclusion
• FlynnTaxonomy• IntelSSESIMDInstructions
– Exploitdata-levelparallelisminloops– Oneinstructionfetchthatoperatesonmultipleoperandssimultaneously
– 128-bitXMMregisters• SSEInstructionsinC
– EmbedtheSSEmachineinstructionsdirectlyintoCprogramsthroughuseofintrinsics
– Achieveefficiencybeyondthatofoptimizingcompiler
31
Recommended