Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Power and Energy Normalized Speedup Models for Heterogeneous Many Core Computing
MohammedA.N.Al-hayanni1,AshurRafiev2,RishadShafik1,FeiXia1
SchoolofEEE1andCS2,NewcastleUniversityNewcastleUponTyne,NE17RU,UK
ACSD 2016
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
2
Amdahl’s Law • Fixedworkload
– (50%parallelizableP=0.5)– OnasequenHalprocessor(singlecore)takes1unitofHmetocomplete
3
Amdahl’s Law • Withtwocores…
– Parallelizablepartisdistributedbetweenthetwocores– TotalHme0.75– Speedup=1/0.75=1.333
4
Amdahl’s Law • Withthreecores…
– Speedup=1/0.667=1.5
5
Amdahl’s Law • Withtencores…
– Speedup=1.82!" ! = !(1)!(!) =
11− ! + !!
6
Amdahl’s Law • With∞cores…
– Speedup=2!" ! = !(1)!(!) =
11− ! + !!
!" ∞ = 11− !
7
Amdahl’s Law
!" ! = !(1)!(!) =1
1− ! + !!
!" ∞ = 11− ! 8
Gustafson’s Law • WorkloadscaleswithcompuHngfaciliHes
– (50%parallelizableP=0.5)– OnasequenHalprocessor(singlecore)takes1unitofHmetocompleteworkloadW(1)designedwithasinglecoreinmind
OpenCLhostoperaHon OCLKernelsfor480x270pixels
9
Gustafson’s Law • Withfourcores…
– Complete960x540imageinthesameHme– 0.5x5=2.5Hmestheworkload(speedup=2.5)
OpenCLhostoperaHon OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
10
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels Gustafson’s Law • With16cores…
– CompleteFHDimageinthesameHme– Speedup=0.5x17=8.5
OpenCLhostoperaHon OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
11
Gustafson’s Law • SpeedupistheraHobetweentheimprovedworkloadand
theworkloadbeforeimprovement
– CalculatedatfixedHme!(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"
12
Gustafson’s Law
!(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"
13
Sun-Ni Law • Memory-boundspeedupmodel
– Parallelworkloadpercorerestrictedbymemorystructure(mulH-levelcaches,sharedmemory/interfaces,etc.)
– Onecore’sworkloadcapabilityrestrictedbyM–thememoryofonecore,Ncores’workloadcapabilityrestrictedbyNxM
– ForthePpart: !(1) = !(!)! ! = ! !×! = !(!×!!! ! 1 )
14
Sun-Ni Law • MulH-corespeedupisderivedthusly:
! ! = 1− ! ! 1 + !×!(!×!!! ! 1 )! 1 = 1− ! ! 1 + !×!(!)! 1 = 1− ! ! 1 + !×!(!×!
!! ! 1 )!
!" ! =!(!)!(1) =
1− ! !(1)+ !×!(!×!!! ! 1 )
1− ! !(1)+ !×!(!×!!! ! 1 )!
15
Sun-Ni Law • TryingtoremoveW(1):
!" ! ! = !!! !"#ℎ !"#$%&"' ! !"# !
! !×! = !!×!"! = !!×! ! , !ℎ!"! !×!!! ! 1 = !!×!(!!! ! 1 = !!×!(1)!" ! = 1− ! + !×!(!)
1− ! + !×!(!)!,!"#ℎ ! ! = !!
16
!" ! =!(!)!(1) =1− ! !(1)+ !×!(!×!!! ! 1 )
1− ! !(1)+ !×!(!×!!! ! 1 )!
Sun-Ni Law • Dependingong(N)
– Sub-linearscaling(Amdahl’sifg(N)=1)– Linearscaling(Gustafson’sifg(N)=N)– Super-linearscaling(ifg(N)>N)
• Ifyouhadmorememorythancores,andtheproblemismemory-bound,youcanscaletohigherspeedupthanwhatyourcoresallowforcompute-boundproblems
17
Comparing the three models
18
0
S(k)
k
Speed-up, k/ 1
5 10 15
1
5
10
15
20
20
1
p = 1
p = 0
p = 0.95
p = 0.9
p = 0.85p = 0.8
...
Amdahl's Law
1
S(k) = 1(1 – p) + pk
2S(k) = (1 – p) + pk
3S(k) =
(1 – p) + pg(k)(1 – p) + pg(k)k
Gustafson's Model Sun and Ni's Model0 k5 10 151
5
10
15
20
20
1
p = 1
p = 0
p = 0.95p = 0.9p = 0.85
p = 0.5
...
p = 0.15p = 0.1p = 0.05
...
S(k)Speed-up, k/ 1
0 k5 10 151
5
10
15
20
20
1
p = 1
p = 0
p = 0.15
p = 0.1
p = 0.05
p = 0.2
...
g(k) = k3/2
p = 0.25
S(k)Speed-up, k/ 1
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
19
Motivation • Extendtoahigherdegreeofcoreheterogeneity• Extendtopower/energy/efficiencyandcovermodeslike
dynamicvoltageandfrequencyscaling(DVFS)
• PotenHalapplicaHonsinrun-Hmemanagementsystemsofparallelsystems
20
Existing Speedup Models and the Extended Model
Homogeneity Heterogeneity Power Amdahl Gustafson SunandNiAmdahl Yes No No Yes No No
Gustafson Yes No No Yes Yes No
SunandNi Yes No No Yes Yes Yes
Hill-Marty Yes Simple No Yes No No
HaoandXie Yes Simple No Yes Yes Yes
WooandLee Yes Simple Yes Yes No No
SunandChen Yes No No Yes Yes Yes
ExtendedModel
Yes Normal Yes Yes Yes Yes
21
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
22
Heterogeneity • ExisHngheterogeneity
include‘asymmetric’and‘dynamic’structures(b)L
• WeextendtocoverthenormalformofcoreheterogeneityJ
• SHlliso-ISAandnotfullygeneralL
23
Heterogeneity • ParallelcomputaHonmaynotallfinishtogether(Amdahl’s)
24
!! = min! ∙ !!!
!!!
! = !! ,!! ,… ,!!
BCE performance equivalence • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
max!,avr!,opt! alsopossible
25
Speedup extension Amdahl’s • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
!! = min! ∙ !!!
!!!
!" ! = 11− !
!! +!!!
, (!"#$ℎ!!!)
26
SequenHalonfastestcore,ifαXisfastest
Speedup extension Amdahl’s • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
!! = min! ∙ !!!
!!!
!" ! = 11− !
!! +!!!
, (!"#$ℎ!!!)
Parallelsyncedtotheslowest,incaseofminα !" ! = !(1)!(!) =
11− ! + !!
Nowinvectorspace
27
Speedup extension Gustafson’s • Gustafson’sspeedupmodelextension
– AgainassumingsequenHalonatypeXcoreandparallelonNα
!" ! = 1− ! + !"
SequenHalonfastestcore,ifαXisfastest
Parallelsyncedtotheslowest,incaseofminα
!" ! = 1− ! !! + !!!
28
Speedup extension Sun-Ni’s • ExtendingSun-Ni’smodel
– AgainassumingsequenHalonatypeXcoreandparallelonNα
SequenHalonfastestcore,ifαXisfastest
Parallelsyncedtotheslowest,incaseofminα !" ! = 1− ! + !!(!)
1− ! + !!(!)!
MemoryboundfuncHonforallcores
!" ! = 1− ! + !"(!)1− ! !! + !"(!)!!
29 ReducestoextendedAmdahl’sandGustafson’sasexpected
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
30
Power • DividepowerintoeffecHveandidle
!!"!#$ =!(!)+!!"#$
EffecHvepower:powerusedbyworkload
IdlepowerincludesbothstaHcpowerandacHvepowerthat’snotusedby
workload
whenNiBCEsareidle31
!(!) =!!!! ! +!!!!(!)!! ! + !!(!)
!! = !!!! ,!! is the power of one !"#
!! =!! !!!!!
!!!= !!!!
!!"#$ = !! ∙!!
Effective power
32
• Theβsaresimilartotheαs,butpertaintopower– Anith-typecoreconsumesβiW1powerandhasαispeed(speedofone
coreis1–forthroughput,wedealwithspeedup,forpower,wedealwithwaqageandnotraHos)
– Nβisthepower-equivalentnumberofBCEs– Forsynchronizingontheslowestcore:
!! = min! ∙!!!!!!
!
!!!
Effective power
33
• EffecHvepowerformulashavebeenderivedforallthreetypesofmodels/laws
– Canbeviewedasresultsof‘powerscaling’withPSfuncHons:
!(!) = !" ! ∙ !"(!) ∙!!
With!! = !! = 1,∀!,and! = !allmodelstransformtohomogeneousforms
Efficiency
34
• Power-normalizedperformance– IPS/Waq
• EnergyperinstrucHon– Joules/InstrucHon
• Energy-normalizedperformance– IPS/Joule
• Allmodelsinthepaper
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
35
Experimental platform
36
System characterization
37
• BuildparametersthroughexperimentaHon– WA7,WA15,Widle(forthe‘whole’–didnottrydifferenHaHngdifferent
Wi–coresnotturnedoffeveninNA7=0andNA15=0cases)– αA7,αA15,βA7,βA15– Maybedifferentfordifferentapps/computaHons– CPU-heavytasksmainlyexperimentedinthisiniHalstudy,minα
scheduling
– log,sqrt,andintegerarithmeHctested
Exploration with models
38
• RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)
hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf
Exploration with models
39
• RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)
hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf
P=0.9 P=0.1
Cross-validation
40
• Amdahl’swithprogrammer-controlledPvalues
Max0.24%
Max2.17%
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
41
Conclusions
42
• ExtendedpopularparallelizaHonspeedupmodelstocoverawiderrangeofiso-ISA(orcoefficient-equivalentISA)heterogeneity
• Extendedpowermodelsandefficiencymodels• Firstcross-validaHonstudysuccessful• NeedtoinvesHgateevenwiderscopesofheterogeneity• Needtostudyotherspeedupmodels(e.g.Downey’s)• NeedtoinvesHgaterealisHcmemorysubsystems(cachemisses)• Needtoexploreusingthesemodelsforrun-Hmemanagement