Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Modeling the Implications of DRAM Failures and Protection Techniques
on Datacenter TCOPanagiota Nikolaou1, Yiannakis Sazeides1, Lorena Ndreu 1, Marios Kleanthous 2
1University of Cyprus, 2MAP S.Platis
MICRO 48, Waikiki, Hawaii, December 5th 2015P. Nikolaou 1
Many Million $ per month
Today’s Datacenters
P. Nikolaou MICRO 48, Waikiki, Hawaii 2
> 510,000 DC in all over the world [Emerson, 2011] > 285 Million Sqft [Emerson, 2011]
Large scale Datacenters: >10,000 commodity servers
Datacenter Cost
P. Nikolaou MICRO 48, Waikiki, Hawaii 3
Other Capex
Expenses49%Other
Opex Expenses
20%
DRAM Opex & Capex
Expenses31%
[Analysis using COST‐ET tool, D. Hardy 2013]
Other Capex
Expenses49%Other
Opex Expenses
20%
DRAM Opex & Capex
Expenses31%
DRAM Protection Cost
P. Nikolaou MICRO 48, Waikiki, Hawaii 4
DRAMProtectionOpex & Capex8%
Other Capex & Opex Cost
69%
DRAM Opex & Capex Expenses
23%
[Analysis using COST‐ET tool, D. Hardy 2013]
Data ECC
Do we need DRAM protection?
P. Nikolaou MICRO 48, Waikiki, Hawaii 5
• Google Failure Study [Barroso, 2009]
• DRAM large field studies [V. Shridharan 2012, 2013]
DRAM FITS/ Server
[Borucki, IRPS 2008], [K. Lim ISCA 2009], [Daniel Bowers, “Server Trends”]
DRAM protection is essential !!
DRAM protection choices
P. Nikolaou MICRO 48, Waikiki, Hawaii 6
ChipkillDC ChipkillSC SECDED
Cost
Reliability
++++
Performance +
Cost
Reliability
++++
Performance ++
Cost
Reliability
++++
Performance +++
DRAM protection selection
P. Nikolaou MICRO 48, Waikiki, Hawaii 7
Application
*Analyzer of Memory Protection and Failures Implications on TCO (AMPRA tool), site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php
AMPRA*
Our Proposition & our Contribution
P. Nikolaou MICRO 48, Waikiki, Hawaii 8
AMPRA tool
Best DRAM Protection Technique
…
Availability/MTTF Model
DRAM SDCModel
Thermal Model
DIMM Cost Model
Energy Model
DIMM FIT Model
Server Performance
Model
ReliabilityPerformanceApplication
CharacteristicsPower Error
protection techniques
TCOModel
Related work• [Y. Luo DSN 2014] Proposes and analyzes cost of a heterogeneous memory protection scheme
Differences:– Performance, power implications of memory protection techniques
– Co‐located services
– Datacenter cost
• No other related work considers various parameters
P. Nikolaou MICRO 48, Waikiki, Hawaii 9
Outline
• Proposed Framework (AMPRA tool)• Use Case • Experimental Framework• Results• Conclusions
P. Nikolaou MICRO 48, Waikiki, Hawaii 10
DRAM FIT Model
Availability/MTTFModel
TCOModel
DIMMFITS_CE
Total extra servers
ComponentMTTF for
replacement
ECC technique
System configuration
DIMMFITS_DUE
HW and SW repair options and their MTTR
DIMMFITS_NDE
Fits per mode(transient, permanent, physical
location)
#devices/DIMM
DC configuration
Device width/size
DRAM Grade Factor
Model for proactive replacement
Maintenance model for replacement on faulty components
#DC servers Energy Model
ServerPerformance
Model
Server Energy
DIMMCost Model
DIMM cost
Device size
DRAM frequency
PublishedData
ECC technique
#devices/DIMM
Device width
DRAM brand
DRAM technology
Server configuration(#cores,Interleaving
type, #channels, DIMMs/channel)
Performance Degradation
(PD)
#threads for Online Service
#threads for Offline Service
PublishedData
#threads for Online Service#threads for Offline Service
DRAM SDCDerating
Model
DIMMDerated
FITS_NDE
Server Configuration
System Reliability
(MTTF_SDC)TCO
#threads for Online Service
#threads for Offline Service
Utilization Profile per day for the online service
ThermalModel
Non DRAM component Reference MTTF
#threads for Online Service
#threads for Offline ServiceComponent Temperature
ECC technique
Server configuration(#cores, Interleaving type,
#channels, DIMMs/channel)
ECC technique
Server configuration(#cores,Interleaving type,
#channels, DIMMs/channel)
Reference ECC technique
Component Reference Temperature
Average Utilization
PublishedData
NDE Derating FactorDIMM
FITS_SDC
Server configuration(#cores,Interleaving type,
#channels, DIMMs/channel)
Target Reliability
Proposed framework (AMPRA tool)
P. Nikolaou MICRO 48, Waikiki, Hawaii 11
Outline
• Proposed Framework (AMPRA tool)• Use Case • Experimental Framework• Results• Conclusions
P. Nikolaou MICRO 48, Waikiki, Hawaii 12
Use Case
P. Nikolaou MICRO 48, Waikiki, Hawaii 13
Bandwidth vs. Latency vs. Reliability vs. Power Chipkill with Dual Channel Implementation (ChipkillDC)Chipkill with Single Channel Implementation (ChipkillSC)16 ECC bits for 128 Data bits‐144 bit codeword
Memory Controller
Data ECC
8B8B8B8B8B8B8B8B
64BBlock
72b72b72b72b
12345678
FIT model
• ChipkillDC:– Detects all the errors in 2 devices – Corrects all the errors in 1 device
• ChipkillSC:– Cannot detect all the errors in 2 devices – Corrects all the errors in 1 device
ChipkillDC can provide better Reliability than ChipkillSC
P. Nikolaou MICRO 48, Waikiki, Hawaii 14
Performance and Power model How it works: (ChipkillDC)• Read
P. Nikolaou MICRO 48, Waikiki, Hawaii 15
•Requires accessing two DIMMs•Codeword in a single burst•Latency short•Low Bandwidth •High Power Consumption
Data ECC
Memory Controller
Data ECC
8B8B8B8B8B8B8B8B
64BBlock
72b72b 72b72b
Performance and Power model How it works: (ChipkillSC)• Read
P. Nikolaou MICRO 48, Waikiki, Hawaii 16
Data ECC
Memory Controller
•Requires accessing one DIMM•Codeword in two bursts•Latency long•High Bandwidth•Less Power Consumption
144 bits
8B8B8B8B8B8B8B8B
64BBlock
72b72b
Design Space
P. Nikolaou MICRO 48, Waikiki, Hawaii 17
• Application characteristics• Memory intensive, compute intensive etc.
• Co‐running applications
What happens with the Cost?ChipkillDCChipkillSC
Reliability
Bandwidth
Latency
Cannot detect all the errors in 2 devices Corrects all the errors in 1 device
Access one DIMM Access two DIMMs
Codeword in two bursts Codeword in one burst
Power Access one DIMM Access two DIMMs
Detect all the errors in 2 devices Corrects all the errors in 1 device
P. Nikolaou MICRO 48, Waikiki, Hawaii 18
Online Services: High QoS requirements
Offline Services: Do not have QoS constrains
Online and Offline Services
Co‐location: Improve server utilization and reduce TCO
Core 0 Core 1 Core 3Core 2
Memory Controller
DRAM
Online Service
Online Service
Core 0 Core 1
Offline Service
Offline Service
Core 2 Core 3
DRAM
Outline
• Proposed Framework (AMPRA tool)• Use Case• Experimental Framework• Results• Conclusions
P. Nikolaou MICRO 48, Waikiki, Hawaii 19
Experimental Framework
P. Nikolaou MICRO 48, Waikiki, Hawaii 20
Fit Model
Analytical Models
Performance, Power and Thermal Model ChipkillDC
–Lockstep Mode
ChipkillSC‐
Advance Mode
Server Configuration Intel Xeon E5‐56204 cores per CPU2 channels per CPU1 DIMM per Channel
Workloads1. Web Search (QoS requirements) 2. MapReduce:a. 500MB (CPU intensive)b. 49000MB (memory intensive)
DIMM Cost Public Data
TCO Model
Extension COST‐ET Tool[D. Hardy 2013]
DC ConfigurationServer Modules: 50,000DC depreciation: 15year
P. Nikolaou
DRAM Protection Implications on Performance
MICRO 48, Waikiki, Hawaii 21
WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB
P. Nikolaou
DRAM Protection Implications on Power
MICRO 48, Waikiki, Hawaii 22
WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB
P. Nikolaou MICRO 48, Waikiki, Hawaii 23
DRAM Protection Implications on Cost
• Underlines the importance of understanding the usage and characteristics of all the services to be run in a DC before making memory protection design choices
• Highlights the need of proposed framework !!
WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB
Usage• Datacenter designers: Select processor and protectiontechnique
• Researchers: Investigate the implications of new ideasrelated to DRAM failures and DRAM protectiontechniques
• Service providers: Find how to charge for runningoffline services and to makeup for the increase in TCOdue to co‐location
P. Nikolaou MICRO 48, Waikiki, Hawaii 24
More in the paper
• Detailed explanation of each model
• DRAM grades and how affect TCO
• Results for other protection techniques (SECDED)
• Power and performance results for more applications
P. Nikolaou MICRO 48, Waikiki, Hawaii 25
• DRAM is one of the dominant cost consumers in a DC
• Different protection techniques have different TCO implications
• Framework to encapsulates all the parameters and tries to determine the cost‐effective protection technique for a DC
• Highlight the need of the framework – It is not straightforward to decide which DRAM protection technique is best for a DC setup in the lack of this framework
Conclusions
P. Nikolaou MICRO 48, Waikiki, Hawaii 26
Future Work
• Evaluate TCO for more online and offline services
• Explore the cost‐benefits of new ECC schemes
• Validation of the framework by using detailed logs from a real DC
P. Nikolaou MICRO 48, Waikiki, Hawaii 27
P. Nikolaou MICRO 48, Waikiki, Hawaii 28
AMPRA tool download site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php