42
Power and Energy Normalized Speedup Models for Heterogeneous Many Core Computing Mohammed A. N. Al-hayanni 1 , Ashur Rafiev 2 , Rishad Shafik 1 , Fei Xia 1 School of EEE 1 and CS 2 , Newcastle University Newcastle Upon Tyne, NE1 7RU, UK ACSD 2016

Power and Energy Normalized Speedup Models for ...async.org.uk/presentations/ACSD_34.pdf · Gustafson’s Law • Workload scales with compuHng faciliHes – (50% parallelizable P=0.5)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Power and Energy Normalized Speedup Models for Heterogeneous Many Core Computing

    MohammedA.N.Al-hayanni1,AshurRafiev2,RishadShafik1,FeiXia1

    SchoolofEEE1andCS2,NewcastleUniversityNewcastleUponTyne,NE17RU,UK

    ACSD 2016

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    2

  • Amdahl’s Law •  Fixedworkload

    –  (50%parallelizableP=0.5)–  OnasequenHalprocessor(singlecore)takes1unitofHmetocomplete

    3

  • Amdahl’s Law •  Withtwocores…

    –  Parallelizablepartisdistributedbetweenthetwocores–  TotalHme0.75–  Speedup=1/0.75=1.333

    4

  • Amdahl’s Law •  Withthreecores…

    –  Speedup=1/0.667=1.5

    5

  • Amdahl’s Law •  Withtencores…

    –  Speedup=1.82!" ! = !(1)!(!) =

    11− ! + !!

    6

  • Amdahl’s Law •  With∞cores…

    –  Speedup=2!" ! = !(1)!(!) =

    11− ! + !!

    !" ∞ = 11− !

    7

  • Amdahl’s Law

    !" ! = !(1)!(!) =1

    1− ! + !!

    !" ∞ = 11− ! 8

  • Gustafson’s Law •  WorkloadscaleswithcompuHngfaciliHes

    –  (50%parallelizableP=0.5)–  OnasequenHalprocessor(singlecore)takes1unitofHmetocompleteworkloadW(1)designedwithasinglecoreinmind

    OpenCLhostoperaHon OCLKernelsfor480x270pixels

    9

  • Gustafson’s Law •  Withfourcores…

    –  Complete960x540imageinthesameHme–  0.5x5=2.5Hmestheworkload(speedup=2.5)

    OpenCLhostoperaHon OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    10

  • OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels Gustafson’s Law •  With16cores…

    –  CompleteFHDimageinthesameHme–  Speedup=0.5x17=8.5

    OpenCLhostoperaHon OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    OCLKernelsfor480x270pixels

    11

  • Gustafson’s Law •  SpeedupistheraHobetweentheimprovedworkloadand

    theworkloadbeforeimprovement

    –  CalculatedatfixedHme!(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"

    12

  • Gustafson’s Law

    !(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"

    13

  • Sun-Ni Law •  Memory-boundspeedupmodel

    –  Parallelworkloadpercorerestrictedbymemorystructure(mulH-levelcaches,sharedmemory/interfaces,etc.)

    –  Onecore’sworkloadcapabilityrestrictedbyM–thememoryofonecore,Ncores’workloadcapabilityrestrictedbyNxM

    –  ForthePpart: !(1) = !(!)! ! = ! !×! = !(!×!!! ! 1 )

    14

  • Sun-Ni Law •  MulH-corespeedupisderivedthusly:

    ! ! = 1− ! ! 1 + !×!(!×!!! ! 1 )! 1 = 1− ! ! 1 + !×!(!)! 1 = 1− ! ! 1 + !×!(!×!

    !! ! 1 )!

    !" ! =!(!)!(1) =

    1− ! !(1)+ !×!(!×!!! ! 1 )

    1− ! !(1)+ !×!(!×!!! ! 1 )!

    15

  • Sun-Ni Law •  TryingtoremoveW(1):

    !" ! ! = !!! !"#ℎ !"#$%&"' ! !"# !

    ! !×! = !!×!"! = !!×! ! , !ℎ!"! !×!!! ! 1 = !!×!(!!! ! 1 = !!×!(1)!" ! = 1− ! + !×!(!)

    1− ! + !×!(!)!,!"#ℎ ! ! = !!

    16

    !" ! =!(!)!(1) =1− ! !(1)+ !×!(!×!!! ! 1 )

    1− ! !(1)+ !×!(!×!!! ! 1 )!

  • Sun-Ni Law •  Dependingong(N)

    –  Sub-linearscaling(Amdahl’sifg(N)=1)–  Linearscaling(Gustafson’sifg(N)=N)–  Super-linearscaling(ifg(N)>N)

    •  Ifyouhadmorememorythancores,andtheproblemismemory-bound,youcanscaletohigherspeedupthanwhatyourcoresallowforcompute-boundproblems

    17

  • Comparing the three models

    18

    0

    S(k)

    k

    Speed-up, k/ 1

    5 10 15

    1

    5

    10

    15

    20

    20

    1

    p = 1

    p = 0

    p = 0.95

    p = 0.9

    p = 0.85p = 0.8

    ...

    Amdahl's Law

    1

    S(k) = 1(1 – p) + pk

    2S(k) = (1 – p) + pk

    3S(k) =

    (1 – p) + pg(k)(1 – p) + pg(k)k

    Gustafson's Model Sun and Ni's Model0 k5 10 151

    5

    10

    15

    20

    20

    1

    p = 1

    p = 0

    p = 0.95p = 0.9p = 0.85

    p = 0.5

    ...

    p = 0.15p = 0.1p = 0.05

    ...

    S(k)Speed-up, k/ 1

    0 k5 10 151

    5

    10

    15

    20

    20

    1

    p = 1

    p = 0

    p = 0.15

    p = 0.1

    p = 0.05

    p = 0.2

    ...

    g(k) = k3/2

    p = 0.25

    S(k)Speed-up, k/ 1

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    19

  • Motivation •  Extendtoahigherdegreeofcoreheterogeneity•  Extendtopower/energy/efficiencyandcovermodeslike

    dynamicvoltageandfrequencyscaling(DVFS)

    •  PotenHalapplicaHonsinrun-Hmemanagementsystemsofparallelsystems

    20

  • Existing Speedup Models and the Extended Model

    Homogeneity Heterogeneity Power Amdahl Gustafson SunandNiAmdahl Yes No No Yes No No

    Gustafson Yes No No Yes Yes No

    SunandNi Yes No No Yes Yes Yes

    Hill-Marty Yes Simple No Yes No No

    HaoandXie Yes Simple No Yes Yes Yes

    WooandLee Yes Simple Yes Yes No No

    SunandChen Yes No No Yes Yes Yes

    ExtendedModel

    Yes Normal Yes Yes Yes Yes

    21

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    22

  • Heterogeneity •  ExisHngheterogeneity

    include‘asymmetric’and‘dynamic’structures(b)L

    •  WeextendtocoverthenormalformofcoreheterogeneityJ

    •  SHlliso-ISAandnotfullygeneralL

    23

  • Heterogeneity •  ParallelcomputaHonmaynotallfinishtogether(Amdahl’s)

    24

  • !! = min! ∙ !!!

    !!!

    ! = !! ,!! ,… ,!!

    BCE performance equivalence •  CalculaHngtheperformanceequivalentnumberofBCEs

    –  Basedontheslowest(lasttofinish)core

    max!,avr!,opt! alsopossible

    25

  • Speedup extension Amdahl’s •  CalculaHngtheperformanceequivalentnumberofBCEs

    –  Basedontheslowest(lasttofinish)core

    !! = min! ∙ !!!

    !!!

    !" ! = 11− !

    !! +!!!

    , (!"#$ℎ!!!)

    26

  • SequenHalonfastestcore,ifαXisfastest

    Speedup extension Amdahl’s •  CalculaHngtheperformanceequivalentnumberofBCEs

    –  Basedontheslowest(lasttofinish)core

    !! = min! ∙ !!!

    !!!

    !" ! = 11− !

    !! +!!!

    , (!"#$ℎ!!!)

    Parallelsyncedtotheslowest,incaseofminα !" ! = !(1)!(!) =

    11− ! + !!

    Nowinvectorspace

    27

  • Speedup extension Gustafson’s •  Gustafson’sspeedupmodelextension

    –  AgainassumingsequenHalonatypeXcoreandparallelonNα

    !" ! = 1− ! + !"

    SequenHalonfastestcore,ifαXisfastest

    Parallelsyncedtotheslowest,incaseofminα

    !" ! = 1− ! !! + !!!

    28

  • Speedup extension Sun-Ni’s •  ExtendingSun-Ni’smodel

    –  AgainassumingsequenHalonatypeXcoreandparallelonNα

    SequenHalonfastestcore,ifαXisfastest

    Parallelsyncedtotheslowest,incaseofminα !" ! = 1− ! + !!(!)

    1− ! + !!(!)!

    MemoryboundfuncHonforallcores

    !" ! = 1− ! + !"(!)1− ! !! + !"(!)!!

    29 ReducestoextendedAmdahl’sandGustafson’sasexpected

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    30

  • Power •  DividepowerintoeffecHveandidle

    !!"!#$ =!(!)+!!"#$

    EffecHvepower:powerusedbyworkload

    IdlepowerincludesbothstaHcpowerandacHvepowerthat’snotusedby

    workload

    whenNiBCEsareidle31

    !(!) =!!!! ! +!!!!(!)!! ! + !!(!)

    !! = !!!! ,!! is the power of one !"#

    !! =!! !!!!!

    !!!= !!!!

    !!"#$ = !! ∙!!

  • Effective power

    32

    •  Theβsaresimilartotheαs,butpertaintopower–  Anith-typecoreconsumesβiW1powerandhasαispeed(speedofone

    coreis1–forthroughput,wedealwithspeedup,forpower,wedealwithwaqageandnotraHos)

    –  Nβisthepower-equivalentnumberofBCEs–  Forsynchronizingontheslowestcore:

    !! = min! ∙!!!!!!

    !

    !!!

  • Effective power

    33

    •  EffecHvepowerformulashavebeenderivedforallthreetypesofmodels/laws

    –  Canbeviewedasresultsof‘powerscaling’withPSfuncHons:

    !(!) = !" ! ∙ !"(!) ∙!!

    With!! = !! = 1,∀!,and! = !allmodelstransformtohomogeneousforms

  • Efficiency

    34

    •  Power-normalizedperformance–  IPS/Waq

    •  EnergyperinstrucHon–  Joules/InstrucHon

    •  Energy-normalizedperformance–  IPS/Joule

    •  Allmodelsinthepaper

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    35

  • Experimental platform

    36

  • System characterization

    37

    •  BuildparametersthroughexperimentaHon–  WA7,WA15,Widle(forthe‘whole’–didnottrydifferenHaHngdifferent

    Wi–coresnotturnedoffeveninNA7=0andNA15=0cases)–  αA7,αA15,βA7,βA15–  Maybedifferentfordifferentapps/computaHons–  CPU-heavytasksmainlyexperimentedinthisiniHalstudy,minα

    scheduling

    –  log,sqrt,andintegerarithmeHctested

  • Exploration with models

    38

    •  RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)

    hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf

  • Exploration with models

    39

    •  RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)

    hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf

    P=0.9 P=0.1

  • Cross-validation

    40

    •  Amdahl’swithprogrammer-controlledPvalues

    Max0.24%

    Max2.17%

  • Outline •  ExisHngspeedupmodels•  MoHvaHon•  Extendedheterogeneousspeedupmodels•  PowerconsumpHonmodels•  Powerandenergynormalizedspeedup•  ExperimentalresultsandcrossvalidaHon•  Conclusions

    41

  • Conclusions

    42

    •  ExtendedpopularparallelizaHonspeedupmodelstocoverawiderrangeofiso-ISA(orcoefficient-equivalentISA)heterogeneity

    •  Extendedpowermodelsandefficiencymodels•  Firstcross-validaHonstudysuccessful•  NeedtoinvesHgateevenwiderscopesofheterogeneity•  Needtostudyotherspeedupmodels(e.g.Downey’s)•  NeedtoinvesHgaterealisHcmemorysubsystems(cachemisses)•  Needtoexploreusingthesemodelsforrun-Hmemanagement