CS 61C: Great Ideas in Computer Architecture Lecture 18...

Preview:

Citation preview

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

ReferenceProblem

•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks

−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)

•dgemm−doubleprecisionfloatingpointmatrixmultiplication

CS61c Lecture18:ParallelProcessing- SIMD 5

ApplicationExample:DeepLearning

• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying

CS61c Lecture18:ParallelProcessing- SIMD 6

Matrices

CS61c Lecture18:ParallelProcessing- SIMD 7

𝑐"#

• Square(orrectangular)NxNarrayofnumbers− DimensionN

𝐶 = 𝐴 ' 𝐵

𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗N-1

N-1

00

MatrixMultiplication

CS61c 8

𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗

𝑘

𝑘

Reference:Python• MatrixmultiplicationinPython

CS61c Lecture18:ParallelProcessing- SIMD 9

N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

C

• c=axb• a,b,careNxNmatrices

CS61c Lecture18:ParallelProcessing- SIMD 10

TimingProgramExecution

CS61c Lecture18:ParallelProcessing- SIMD 11

CversusPython

CS61c Lecture18:ParallelProcessing- SIMD 12

N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>1instruction@onetimee.g.,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance

20CS61c Lecture18:ParallelProcessing- SIMD

Thishasfewapplications.Notcoveredin61C.

SIMDApplications&Implementations

• Applications− Scientificcomputing

§ Matlab,NumPy− Graphicsandvideoprocessing

§ Photoshop,…− BigData

§ Deeplearning− Gaming−…

• Implementations− x86− ARM−…

CS61c Lecture18:ParallelProcessing- SIMD 24

RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

Peakdoubleflops 24.8Gflops

CS61c Lecture18:ParallelProcessing- SIMD 36

Actualperformanceislowerbecauseofoverhead

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘

InnerLoop:

fori …;i+=4forj...

i+=4

“Vectorized”dgemm

CS61c Lecture18:ParallelProcessing- SIMD 38

Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c Lecture18:ParallelProcessing- SIMD 39

• 4xfaster• Butstill<<theoretical25Gflops!

PipelineHazards– dgemm

CS61c Lecture18:ParallelProcessing- SIMD 54

LoopUnrolling

CS61c Lecture18:ParallelProcessing- SIMD 55

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c Lecture18:ParallelProcessing- SIMD 56

Recommended