20
Reliability and Performance Trade-off Study of Heterogeneous Memories Manish Gupta Ψ , David Roberts*, Mitesh Meswani*, Vilas Sridharan* Dean Tullsen Ψ , Rajesh Gupta Ψ MEMSYS 16 | Session 7 1 * Ψ

MEMSYS 16 | Session 7

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

ReliabilityandPerformanceTrade-offStudyof

HeterogeneousMemories

ManishGuptaΨ,DavidRoberts*,MiteshMeswani*,VilasSridharan*DeanTullsenΨ,RajeshGuptaΨ

MEMSYS16|Session7

1

2

Outline

•  HeterogeneousMemoryArchitectures(HMAs)•  ReliabilityAssessmentofMemorySystems•  SystemConfiguraIon•  EvaluaIonMethodology•  Discussion•  Summary

•  HeterogeneousMemoryArchitecturesconsistofmulIplememorymodules.–  Forexample:FigureaboveshowsanHMAsystemwithDie-stackedDRAMandoff-package

DDRxmemory.

•  Mostoftheresearchonheterogeneousmemoriespresentonlyperformancetrade-offsofplacingdatainonememoryovertheother.However,reliabilityisalsobecomingimportant.Especiallyforhigh-performancecompuInganddatacenters.

•  Inthiswork,wepresentastudyonReliabilityvs.Performancetrade-offofanHMAsystem.

HeterogeneousMemoryArchitecture

3

•  Faultsareunderlyingcauseofafailure.–  Faultscanbepermanent.

•  Forexample:consistentlywrongvaluereturnedfrommemoryduetohardwarefault(stuck-atbit).

–  Faultscanbetransient.•  Forexample:soUerrorsduetosingle-eventupsetsortemperaturedependent

faults.

•  ErrorsaremanifestaIonoffaults.–  Errorscandetectedand/orcorrectedbyerrorcorrecIngcodes(ECC).

•  Wefocusonobservableorunmaskedfaultsthatmanifestasuncorrectableerrors.–  InparIcular,weexaminetherateofuncorrectableerrors.

ReliabilityAssessmentofMemorySystems

4

5

SystemConfiguraKon

Processor16cores

M1Memory(HBM[1])ECC:SEC-DED

L1/L2Caches

M2Memory(DDR3)

ECC:ChipKill

[1]HBMv1

LogicDie+

die-stackedmem

ory

ConvenIonaloff-chipmemory

6

SystemConfiguraKon

AccessRate=(NumberofM1Accesses)(NumberofM1+M2Accesses)

FP_HMA[Ime]=AccessRatexFP_M1[Ime]+(1-AccessRate)xFP_M2[Ime]

Processor16cores

M1Memory(HBM[1])ECC:SEC-DED

L1/L2Caches

M2Memory(DDR3)

ECC:ChipKill

LogicDie+

die-stackedmem

ory

ConvenIonaloff-chipmemory

7

EvaluaKonMethodologyToolsUsed•  FaultSim[1]forfaultanderrorcorrecIonschemesimulaIons.

•  FaultSimisanevent-basedmemoryfaultsimulator.•  FaultSimassumeseveryfaultresultsinanerror.

•  Ramulator[2]forperformancesimulaIons.•  RamulatorisaDRAMsimulatorprovidingperformancemodelsfordifferent

memorystandardssuchasDDR3/4,LPDDR3/4,GDDR5,andHBM.

[1]FAULTSIM:heps://github.com/Prashant-GTech/FaultSim-A-Memory-Reliability-Simulator,DavidRobertsandPrashantNair.

[2]Ramulator:heps://github.com/CMU-SAFARI/ramulator,Kimetal.

Workload(x16) MemoryFootprint MPKI(MissesPerKiloInstrucKons)

mcf 16.02GB 65.03(HighBandwidth)

astar 2.63GB 16.70(MediumBandwidth)

cactus 2.31GB 3.70(LowBandwidth)

WorkloadsUsed:SelectedthreerepresentaIvesampleworkloads.

8

Discussion:FailureProbabilityofHMAs

FailureProbabilityofHMA|z-axis

Aperformance-focusedaccessratecontrolcandeteriorateFailureProbabilityofanHMAsystemby

1000xofiniIalyear.

ThenextquesIonwemustanswerinthisstudyishowtheincreaseinaccessrateaffectsperformanceof

differentworkloads.

9

Discussion:Performancevs.ReliabilityTrade-off

•  DividetheenIrememoryfootprintinto4KBpages.

•  Randomlyallocateapercentageof4KBpagestoM1memory.

Workload(x16) MemoryFootprint

cactus 2.31GB

M2

M1

M2

M2

Mem

ory

Footprint

4KBPage

10

Week0

Week0-180

Week0

Discussion:Performancevs.ReliabilityTrade-off

11

Performancevs.ReliabilityTrade-off

12

Discussion:Performancevs.ReliabilityTrade-off

13

Discussion:Performancevs.ReliabilityTrade-off

14

1.  Different work loads havedifferent sensiIvity to accessrateontheirperformance.

2.  Low bandwidth workloads suchas cactus could gain more inreliability by reducing accessrate without loosing a lot inperformance compared to high-bandwidth workloads such asmcf,inmid-andfinal-week.

3.  ThissuggestsapotenIalbenefitfrom an access rate controlsystem that tunes the accessrate for different workloads astheHMAsystemages.

Discussion:Performancevs.ReliabilityTrade-off

15

Discussion:Aging-awareAccessRateControl

Workload(x16) T1 T2 IPCDegradaKoncactus 108 187 23.5%

•  WorkloadstartsoperaIngatit’speakoperaIngpoint.

•  AUerthesystemhitsalimiIngfailureprobabilityinweekT1.

T1 T2

•  TheweekwhenIPCdegradaIonisbelow10%isnotedasT2.

FinalWeek

IPCDegradaIon

16

Workload(x16) T1 T2 IPCDegradaKoncactus(lowbandwidth) 108 187 23.5%astar(mediumbandwidth) 120 174 29.6%

Discussion:Aging-awareAccessRateControl

17

Workload(x16) T1 T2 IPCDegradaKoncactus(lowbandwidth) 108 187 23.5%astar(mediumbandwidth) 120 174 29.6%mcf(highbandwidth) 132 176 34.1%

Discussion:Aging-awareAccessRateControl

18

SummaryThefailureprobabilityofanHMAsystemdependson

memorytype,errorcorrecIngcodes,andage.

BecausethechangeinfailureprobabilityisdependentonworkloadsandHMA’sage,theaccessratecontrolhasto

bedynamic.

Dynamicaccessratecontrol:-Enablessustained&reliableoperaIonofHMAs.-Reducestheneedtoreplacememory(andthedieforthedie-stackedmemory)withtheincreasedrateofuncorrectableerrors.

19

Thanks

Moreaboutthespeaker:hep://cseweb.ucsd.edu/~m7gupta/Contact:[email protected]

20

Disclaimer&AeribuIonTheinformaIonpresentedinthisdocumentisforinformaIonalpurposesonlyandmaycontaintechnicalinaccuracies,omissionsandtypographicalerrors.TheinformaIoncontainedhereinissubjecttochangeandmayberenderedinaccurateformanyreasons,includingbutnotlimitedtoproductandroadmapchanges,componentandmotherboardversionchanges,newmodeland/orproductreleases,productdifferencesbetweendifferingmanufacturers,soUwarechanges,BIOSflashes,firmwareupgrades,orthelike.AMDassumesnoobligaIontoupdateorotherwisecorrectorrevisethisinformaIon.However,AMDreservestherighttorevisethisinformaIonandtomakechangesfromImetoImetothecontenthereofwithoutobligaIonofAMDtonoIfyanypersonofsuchrevisionsorchanges.AMDMAKESNOREPRESENTATIONSORWARRANTIESWITHRESPECTTOTHECONTENTSHEREOFANDASSUMESNORESPONSIBILITYFORANYINACCURACIES,ERRORSOROMISSIONSTHATMAYAPPEARINTHISINFORMATION.AMDSPECIFICALLYDISCLAIMSANYIMPLIEDWARRANTIESOFMERCHANTABILITYORFITNESSFORANYPARTICULARPURPOSE.INNOEVENTWILLAMDBELIABLETOANYPERSONFORANYDIRECT,INDIRECT,SPECIALOROTHERCONSEQUENTIALDAMAGESARISINGFROMTHEUSEOFANYINFORMATIONCONTAINEDHEREIN,EVENIFAMDISEXPRESSLYADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGES.ATTRIBUTIONS©2016AdvancedMicroDevices,Inc.Allrightsreserved.AMD,theAMDArrowlogoandcombinaIonsthereofaretrademarksofAdvancedMicroDevices,Inc.OthernamesareforinformaIonalpurposesonlyandmaybetrademarksoftheirrespecIveowners.