CS 61C: Great Ideas in Computer Architecture Lecture 18...

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

ReferenceProblem

•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks

−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)

•dgemm−doubleprecisionfloatingpointmatrixmultiplication

CS61c Lecture18:ParallelProcessing- SIMD 5

ApplicationExample:DeepLearning

• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying

Matrices

𝑐"#

• Square(orrectangular)NxNarrayofnumbers− DimensionN

𝐶 = 𝐴 ' 𝐵

𝑐"# = )𝑎"+𝑏+#

𝑗N-1

MatrixMultiplication

CS61c 8

𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#

Reference:Python• MatrixmultiplicationinPython

N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

• c=axb• a,b,careNxNmatrices

TimingProgramExecution

CversusPython

N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>1instruction@onetimee.g.,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance

20CS61c Lecture18:ParallelProcessing- SIMD

Thishasfewapplications.Notcoveredin61C.

SIMDApplications&Implementations

• Applications− Scientificcomputing

§ Matlab,NumPy− Graphicsandvideoprocessing

§ Photoshop,…− BigData

§ Deeplearning− Gaming−…

• Implementations− x86− ARM−…

RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

Peakdoubleflops 24.8Gflops

Actualperformanceislowerbecauseofoverhead

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 37

InnerLoop:

fori …;i+=4forj...

“Vectorized”dgemm

Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64

• 4xfaster• Butstill<<theoretical25Gflops!

PipelineHazards– dgemm

LoopUnrolling

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS 61C: Great Ideas in Computer Architecture Lecture 18...

Documents

Topics in Brain Computer Interfaces CS295-7

LINQ Project Overview - Brown Universitycs.brown.edu/courses/cs295-11/2006/LINQ.pdf · runtime, with the LINQ project we have taken a more general approach and are adding general

CS/CS/CS/HB 319 2012 - CCFJ

CS.900HTœ CS-500HTC CS-700HTC cs-l CS.25TC cs-l CS-D)œ 70 ... · CS.900HTœ CS-500HTC CS-700HTC cs-l CS.25TC cs-l CS-D)œ 70 ././ Od±ã X B 721 mm 285kg 1 20kg 142kg 70kg 3—8B

CS152: Computer Systems Architecture Storytime x86 And …swjun/courses/2021W-CS152/slides... · 2021. 3. 13. · 8087 (1980): floating-point coprocessor o Adds FP instructions and

L08 –RISC V –Function Calls CS295 Agenda

CS295: Modern Systems Warehouse Scale Computers › ~swjun › courses › 2019S-CS295 › slides › b… · Warehouse-scale computers (WSCs) n Provides Internet services n Search,

Adsens Magnetic Sensors - automation-dfw.com21 magnetic switch cs-50 series dimension cs-50n, cs-50p, cs-50d / cs-50n-qd, cs-50p-qd, cs-50d-qd 3 4.7 cs-50r, cs-50r-qd / cs-50rp, cs-50rp-qd

PANTONE Digital Color Library Simulations 223 CS PANTONE 224 CS PANTONE 225 CS PANTONE 226 CS PANTONE 227 CS PANTONE 228 CS PANTONE 229 CS PANTONE 230 CS PANTONE 231 CS PANTONE 232

CS 250B: Modern Computer Systems Introduction To …swjun/courses/2020S-CS250B/material/fpga3 - Bluespec...Bluespec Types Primitive types o Bit, Int, UInt, Bool User-defined types

Relational Joins on Graphics Processorscs.brown.edu/courses/cs295-11/gpujoin.pdfrelational join algorithms on such GPUs. Joins are the cornerstone operator in relational database systems

Privacy Preserving Publication of Moving Object Data Joey Lei CS295 Francesco Bonchi Yahoo! Research Avinguda Diagonal 177, Barcelona, Spain 6/10/20151CS295

Yamaha CS-10 CS-30 CS-30L PatchCharts

CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Hall A - ESfS: Earth Science for Society · 2019. 3. 7. · B8 B9 C5 CS CS CS CS CS CS CS Welcome E4 CS CS CS Bag Drop Area 20' [6.096m] 9' [2.733m] 11'-6" [3.506m] ... Calgary Rock

CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright

Exploit Activation Statistics Network Pruning Compact ...swjun/courses/2019S-CS295/slides/DNN-reduction.pdf• Exploit Activation Statistics • Network Pruning • Compact Network

CS295: Info Quality & Entity Resolution Course Introduction Slides Dmitri V. Kalashnikov University of California, Irvine Spring 2013 Copyright © Dmitri

PRODUCT CATALOG & APPLICATION GUIDE - Techspray...Fits Hakko FX-888, FX-888D Comparable to Hakko A1559 CS-888S CS-888H CS-17 CS-44 CS-47 CS-36 CS-9/625 CS-1 CS-14 CS-14M CS-33 CS-44

ATOM A System for Building Customized Program Analysis Toolsweb.stanford.edu/class/archive/cs/cs295/cs295.1086/papers/atom.pdf · A System for Building Customized Program Analysis