Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
ReferenceProblem
•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks
−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)
•dgemm−doubleprecisionfloatingpointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
ApplicationExample:DeepLearning
• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑐"#
• Square(orrectangular)NxNarrayofnumbers− DimensionN
𝐶 = 𝐴 ' 𝐵
𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗N-1
N-1
00
MatrixMultiplication
CS61c 8
𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗
𝑘
𝑘
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
C
• c=axb• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
SIMDApplications&Implementations
• Applications− Scientificcomputing
§ Matlab,NumPy− Graphicsandvideoprocessing
§ Photoshop,…− BigData
§ Deeplearning− Gaming−…
• Implementations− x86− ARM−…
CS61c Lecture18:ParallelProcessing- SIMD 24
RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
Peakdoubleflops 24.8Gflops
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverhead
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘
InnerLoop:
fori …;i+=4forj...
i+=4
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill<<theoretical25Gflops!
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 54
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 55
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 56