19
CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

ReferenceProblem

•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks

−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)

•dgemm−doubleprecisionfloatingpointmatrixmultiplication

CS61c Lecture18:ParallelProcessing- SIMD 5

Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

ApplicationExample:DeepLearning

• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying

CS61c Lecture18:ParallelProcessing- SIMD 6

Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

Matrices

CS61c Lecture18:ParallelProcessing- SIMD 7

𝑐"#

• Square(orrectangular)NxNarrayofnumbers− DimensionN

𝐶 = 𝐴 ' 𝐵

𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗N-1

N-1

00

Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

MatrixMultiplication

CS61c 8

𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗

𝑘

𝑘

Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

Reference:Python• MatrixmultiplicationinPython

CS61c Lecture18:ParallelProcessing- SIMD 9

N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

C

• c=axb• a,b,careNxNmatrices

CS61c Lecture18:ParallelProcessing- SIMD 10

Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

TimingProgramExecution

CS61c Lecture18:ParallelProcessing- SIMD 11

Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

CversusPython

CS61c Lecture18:ParallelProcessing- SIMD 12

N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance

20CS61c Lecture18:ParallelProcessing- SIMD

Thishasfewapplications.Notcoveredin61C.

Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

SIMDApplications&Implementations

• Applications− Scientificcomputing

§ Matlab,NumPy− Graphicsandvideoprocessing

§ Photoshop,…− BigData

§ Deeplearning− Gaming−…

• Implementations− x86− ARM−…

CS61c Lecture18:ParallelProcessing- SIMD 24

Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

Peakdoubleflops 24.8Gflops

CS61c Lecture18:ParallelProcessing- SIMD 36

Actualperformanceislowerbecauseofoverhead

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘

InnerLoop:

fori …;i+=4forj...

i+=4

Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

“Vectorized”dgemm

CS61c Lecture18:ParallelProcessing- SIMD 38

Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c Lecture18:ParallelProcessing- SIMD 39

• 4xfaster• Butstill<<theoretical25Gflops!

Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

PipelineHazards– dgemm

CS61c Lecture18:ParallelProcessing- SIMD 54

Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

LoopUnrolling

CS61c Lecture18:ParallelProcessing- SIMD 55

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c Lecture18:ParallelProcessing- SIMD 56