CS61C:GreatIdeasinComputerArchitecture
PerformanceIronLaw,Amdahl’sLaw
Instructors:NicholasWeaver&VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages2
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Howdoweknow?
WhatisPerformance?
• Latency(orresponsetimeorexecutiontime)– Timetocompleteonetask
• Bandwidth(orthroughput)– Taskscompletedperunittime• Ifyouhavesufficientindependenttasks,youcanalwaysthrowmoremoneyattheproblem:Throughput/$oftenamoreimportantmetricthanjustthroughput
3
CloudPerformance:WhyApplicationLatencyMatters
• Keyfigureofmerit:applicationresponsiveness– Longerthedelay,thefewertheuserclicks,thelesstheuserhappiness,andthelowertherevenueperuser
4
DefiningCPUPerformance• WhatdoesitmeantosayXisfasterthanY?
• Ferrarivs.SchoolBus?• 2013Ferrari599GTB– 2passengers,quartermilein10secs
• 2013TypeDschoolbus– 50passengers,quartermilein20secs
• ResponseTime (Latency):e.g.,timetotravel¼mile• Throughput (Bandwidth): e.g.,passenger-miin1hour
5
DefiningRelativeCPUPerformance• PerformanceX =1/ProgramExecutionTimeX• PerformanceX >PerformanceY =>1/ExecutionTimeX>1/ExecutionTimey=>ExecutionTimeY >ExecutionTimeX
• ComputerXisNtimesfasterthanComputerYPerformanceX /PerformanceY =NorExecutionTimeY /ExecutionTimeX=N
• BustoFerrariperformance:– Program:Transfer1000passengersfor1mile– Bus:3,200sec,Ferrari:40,000sec
6
MeasuringCPUPerformance
• Computersuseaclocktodeterminewheneventstakesplacewithinhardware
• Clockcycles: discretetimeintervals– akaclocks,cycles,clockperiods,clockticks
• Clockrateorclockfrequency: clockcyclespersecond(inverseofclockcycletime)
• 3GigaHertz clockrate=>clockcycletime=1/(3x109)seconds
clockcycletime=333picoseconds(ps)
7
CPUPerformanceFactors
• TodistinguishbetweenprocessortimeandI/O,CPUtimeistimespentinprocessor
• CPU Time/Program= Clock Cycles/Program
x Clock Cycle Time
• OrCPU Time/Program= Clock Cycles/Program ÷ Clock Rate
8
IronLawofPerformancebyEmer andClark
• A programexecutesinstructions• CPU Time/Program
= Clock Cycles/Program x Clock Cycle Time= Instructions/Program
x Average Clock Cycles/Instruction x Clock Cycle Time
• 1st termcalledInstructionCount• 2nd termabbreviatedCPIforaverageClockCyclesPerInstruction
• 3rdtermis1/Clockrate
9
RestatingPerformanceEquation• Time= Seconds
ProgramInstructions Clockcycles SecondsProgram Instruction ClockCycle
10
××=
WhatAffectsEachComponent?A)InstructionCount,B)CPI,C)ClockRate
AffectsWhat?(click inletterofcomponentnotaffected)
Algorithm
ProgrammingLanguageCompiler
InstructionSet Architecture11
WhatAffectsEachComponent?InstructionCount,CPI,ClockRate
AffectsWhat?Algorithm InstructionCount,
CPIProgrammingLanguage
InstructionCount,CPI
Compiler InstructionCount,CPI
InstructionSetArchitecture
InstructionCount,ClockRate,CPI
12
Clickers
• Whichcomputerhasthehighestperformanceforagivenprogram?
13
Computer Clockfrequency
Clock cyclesperinstruction
#instructionsperprogram
A 1GHz 2 1000
B 2GHz 5 800
C 500MHz 1.25 400
D 5GHz 10 2000
WorkloadandBenchmark
• Workload: Setofprogramsrunonacomputer– Actualcollectionofapplicationsrunormadefromrealprogramstoapproximatesuchamix
– Specifiesprograms,inputs,andrelativefrequencies• Benchmark:Programselectedforuseincomparingcomputerperformance– Benchmarksformaworkload– Usuallystandardizedsothatmanyusethem
14
SPEC(SystemPerformanceEvaluationCooperative)• ComputerVendorcooperativeforbenchmarks,startedin1989
• SPECCPU2006– 12IntegerPrograms– 17Floating-PointPrograms
• Oftenturnintonumberwherebiggerisfaster• SPECratio:referenceexecutiontimeonoldreferencecomputerdividebyexecutiontimeonnewcomputertogetaneffectivespeed-up
15
SPECINT2006onAMDBarcelonaDescription
Instruc-tion
Count (B)CPI
Clock cycle
time (ps)
Execu-tion
Time (s)
Refer-ence
Time (s)
SPEC-ratio
Interpreted string processing 2,118 0.75 400 637 9,770 15.3Block-sorting compression 2,389 0.85 400 817 9,650 11.8GNU C compiler 1,050 1.72 400 724 8,050 11.1Combinatorial optimization 336 10.0 400 1,345 9,120 6.8Go game 1,658 1.09 400 721 10,490 14.6Search gene sequence 2,783 0.80 400 890 9,330 10.5Chess game 2,176 0.96 400 837 12,100 14.5Quantum computer simulation 1,623 1.61 400 1,047 20,720 19.8Video compression 3,102 0.80 400 993 22,130 22.3Discrete event simulation library 587 2.94 400 690 6,250 9.1Games/path finding 1,082 1.79 400 773 7,020 9.1XML parsing 1,058 2.70 400 1,143 6,900 6.016
17
SummarizingPerformance…
Clickers:Whichsystemisfaster?
System Rate(Task1) Rate(Task2)
A 10 20
B 20 10
A:SystemAB:SystemBC:SameperformanceD:Unanswerablequestion!
18
… DependsWho’sSellingSystem Rate(Task1) Rate(Task2)
A 10 20
B 20 10
Average
15
15Averagethroughput
System Rate(Task1) Rate(Task2)
A 0.50 2.00
B 1.00 1.00
Average
1.25
1.00ThroughputrelativetoB
System Rate(Task1) Rate(Task2)
A 1.00 1.00
B 2.00 0.50
Average
1.00
1.25ThroughputrelativetoA
SummarizingSPECPerformance
• Variesfrom6xto22xfasterthanreferencecomputer
• Geometricmeanofratios:N-th rootofproductofNratios– GeometricMeangivessamerelativeanswernomatterwhatcomputerisusedasreference
• GeometricMeanforBarcelonais11.7
19
Administrivia
• Midterm2intheeveningnextMonday• Project2.1gradesin
20
BigIdea:Amdahl’s(Heartbreaking)Law• SpeedupduetoenhancementEis
Speedupw/E= ----------------------Exectimew/oEExectimew/E
• SupposethatenhancementEacceleratesafractionF(F<1)ofthetaskbyafactorS(S>1)andtheremainderofthetaskisunaffected
ExecutionTimew/E=
Speedupw/E=21
ExecutionTimew/oE´[(1-F)+F/S]
1/[(1-F)+F/S]
BigIdea:Amdahl’sLaw
22
Speedup=1(1- F)+F
SNon-speed-uppart Speed-uppart
10.5+0.5
2
10.5+0.25
= = 1.33
Example:theexecutiontimeofhalfoftheprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
Example#1:Amdahl’sLaw
• Consideranenhancementwhichruns20timesfasterbutwhichisonlyusable25%ofthetime
Speedupw/E=1/(.75+.25/20)=1.31
• Whatifitsusableonly15%ofthetime?Speedupw/E=1/(.85+.15/20)=1.17
• Amdahl’sLawtellsusthattoachievelinearspeedupwith100processors,noneoftheoriginalcomputationcanbescalar!
• Togetaspeedupof90from100processors,thepercentageoftheoriginalprogramthatcouldbescalarwouldhavetobe0.1%orless
Speedupw/E=1/(.001+.999/100)=90.9923
Speedupw/E=1/ [(1-F)+F/S]
24
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Thenon-parallelportionlimitstheperformance
StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.– Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
– Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork– Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
25
Clickers/PeerInstruction
26
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
A:5B:16C:20D:100E:Noneoftheabove
Speedupw/E=1/ [(1-F)+F/S]
AndInConclusion,…
27
• Time(seconds/program)ismeasureofperformanceInstructions Clockcycles SecondsProgram Instruction ClockCycle
• Floating-pointrepresentationsholdapproximationsofrealnumbersinafinitenumberofbits
××=