Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ReliabilityandPerformanceTrade-offStudyof
HeterogeneousMemories
ManishGuptaΨ,DavidRoberts*,MiteshMeswani*,VilasSridharan*DeanTullsenΨ,RajeshGuptaΨ
MEMSYS16|Session7
1
*Ψ
2
Outline
• HeterogeneousMemoryArchitectures(HMAs)• ReliabilityAssessmentofMemorySystems• SystemConfiguraIon• EvaluaIonMethodology• Discussion• Summary
• HeterogeneousMemoryArchitecturesconsistofmulIplememorymodules.– Forexample:FigureaboveshowsanHMAsystemwithDie-stackedDRAMandoff-package
DDRxmemory.
• Mostoftheresearchonheterogeneousmemoriespresentonlyperformancetrade-offsofplacingdatainonememoryovertheother.However,reliabilityisalsobecomingimportant.Especiallyforhigh-performancecompuInganddatacenters.
• Inthiswork,wepresentastudyonReliabilityvs.Performancetrade-offofanHMAsystem.
HeterogeneousMemoryArchitecture
3
• Faultsareunderlyingcauseofafailure.– Faultscanbepermanent.
• Forexample:consistentlywrongvaluereturnedfrommemoryduetohardwarefault(stuck-atbit).
– Faultscanbetransient.• Forexample:soUerrorsduetosingle-eventupsetsortemperaturedependent
faults.
• ErrorsaremanifestaIonoffaults.– Errorscandetectedand/orcorrectedbyerrorcorrecIngcodes(ECC).
• Wefocusonobservableorunmaskedfaultsthatmanifestasuncorrectableerrors.– InparIcular,weexaminetherateofuncorrectableerrors.
ReliabilityAssessmentofMemorySystems
4
5
SystemConfiguraKon
Processor16cores
M1Memory(HBM[1])ECC:SEC-DED
L1/L2Caches
M2Memory(DDR3)
ECC:ChipKill
[1]HBMv1
LogicDie+
die-stackedmem
ory
ConvenIonaloff-chipmemory
6
SystemConfiguraKon
AccessRate=(NumberofM1Accesses)(NumberofM1+M2Accesses)
FP_HMA[Ime]=AccessRatexFP_M1[Ime]+(1-AccessRate)xFP_M2[Ime]
Processor16cores
M1Memory(HBM[1])ECC:SEC-DED
L1/L2Caches
M2Memory(DDR3)
ECC:ChipKill
LogicDie+
die-stackedmem
ory
ConvenIonaloff-chipmemory
7
EvaluaKonMethodologyToolsUsed• FaultSim[1]forfaultanderrorcorrecIonschemesimulaIons.
• FaultSimisanevent-basedmemoryfaultsimulator.• FaultSimassumeseveryfaultresultsinanerror.
• Ramulator[2]forperformancesimulaIons.• RamulatorisaDRAMsimulatorprovidingperformancemodelsfordifferent
memorystandardssuchasDDR3/4,LPDDR3/4,GDDR5,andHBM.
[1]FAULTSIM:heps://github.com/Prashant-GTech/FaultSim-A-Memory-Reliability-Simulator,DavidRobertsandPrashantNair.
[2]Ramulator:heps://github.com/CMU-SAFARI/ramulator,Kimetal.
Workload(x16) MemoryFootprint MPKI(MissesPerKiloInstrucKons)
mcf 16.02GB 65.03(HighBandwidth)
astar 2.63GB 16.70(MediumBandwidth)
cactus 2.31GB 3.70(LowBandwidth)
WorkloadsUsed:SelectedthreerepresentaIvesampleworkloads.
8
Discussion:FailureProbabilityofHMAs
FailureProbabilityofHMA|z-axis
Aperformance-focusedaccessratecontrolcandeteriorateFailureProbabilityofanHMAsystemby
1000xofiniIalyear.
ThenextquesIonwemustanswerinthisstudyishowtheincreaseinaccessrateaffectsperformanceof
differentworkloads.
9
Discussion:Performancevs.ReliabilityTrade-off
• DividetheenIrememoryfootprintinto4KBpages.
• Randomlyallocateapercentageof4KBpagestoM1memory.
Workload(x16) MemoryFootprint
cactus 2.31GB
M2
M1
M2
M2
Mem
ory
Footprint
4KBPage
14
1. Different work loads havedifferent sensiIvity to accessrateontheirperformance.
2. Low bandwidth workloads suchas cactus could gain more inreliability by reducing accessrate without loosing a lot inperformance compared to high-bandwidth workloads such asmcf,inmid-andfinal-week.
3. ThissuggestsapotenIalbenefitfrom an access rate controlsystem that tunes the accessrate for different workloads astheHMAsystemages.
Discussion:Performancevs.ReliabilityTrade-off
15
Discussion:Aging-awareAccessRateControl
Workload(x16) T1 T2 IPCDegradaKoncactus 108 187 23.5%
• WorkloadstartsoperaIngatit’speakoperaIngpoint.
• AUerthesystemhitsalimiIngfailureprobabilityinweekT1.
T1 T2
• TheweekwhenIPCdegradaIonisbelow10%isnotedasT2.
FinalWeek
IPCDegradaIon
16
Workload(x16) T1 T2 IPCDegradaKoncactus(lowbandwidth) 108 187 23.5%astar(mediumbandwidth) 120 174 29.6%
Discussion:Aging-awareAccessRateControl
17
Workload(x16) T1 T2 IPCDegradaKoncactus(lowbandwidth) 108 187 23.5%astar(mediumbandwidth) 120 174 29.6%mcf(highbandwidth) 132 176 34.1%
Discussion:Aging-awareAccessRateControl
18
SummaryThefailureprobabilityofanHMAsystemdependson
memorytype,errorcorrecIngcodes,andage.
BecausethechangeinfailureprobabilityisdependentonworkloadsandHMA’sage,theaccessratecontrolhasto
bedynamic.
Dynamicaccessratecontrol:-Enablessustained&reliableoperaIonofHMAs.-Reducestheneedtoreplacememory(andthedieforthedie-stackedmemory)withtheincreasedrateofuncorrectableerrors.
20
Disclaimer&AeribuIonTheinformaIonpresentedinthisdocumentisforinformaIonalpurposesonlyandmaycontaintechnicalinaccuracies,omissionsandtypographicalerrors.TheinformaIoncontainedhereinissubjecttochangeandmayberenderedinaccurateformanyreasons,includingbutnotlimitedtoproductandroadmapchanges,componentandmotherboardversionchanges,newmodeland/orproductreleases,productdifferencesbetweendifferingmanufacturers,soUwarechanges,BIOSflashes,firmwareupgrades,orthelike.AMDassumesnoobligaIontoupdateorotherwisecorrectorrevisethisinformaIon.However,AMDreservestherighttorevisethisinformaIonandtomakechangesfromImetoImetothecontenthereofwithoutobligaIonofAMDtonoIfyanypersonofsuchrevisionsorchanges.AMDMAKESNOREPRESENTATIONSORWARRANTIESWITHRESPECTTOTHECONTENTSHEREOFANDASSUMESNORESPONSIBILITYFORANYINACCURACIES,ERRORSOROMISSIONSTHATMAYAPPEARINTHISINFORMATION.AMDSPECIFICALLYDISCLAIMSANYIMPLIEDWARRANTIESOFMERCHANTABILITYORFITNESSFORANYPARTICULARPURPOSE.INNOEVENTWILLAMDBELIABLETOANYPERSONFORANYDIRECT,INDIRECT,SPECIALOROTHERCONSEQUENTIALDAMAGESARISINGFROMTHEUSEOFANYINFORMATIONCONTAINEDHEREIN,EVENIFAMDISEXPRESSLYADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGES.ATTRIBUTIONS©2016AdvancedMicroDevices,Inc.Allrightsreserved.AMD,theAMDArrowlogoandcombinaIonsthereofaretrademarksofAdvancedMicroDevices,Inc.OthernamesareforinformaIonalpurposesonlyandmaybetrademarksoftheirrespecIveowners.