28
Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO Panagiota Nikolaou 1 , Yiannakis Sazeides 1 , Lorena Ndreu 1 , Marios Kleanthous 2 1 University of Cyprus, 2 MAP S.Platis MICRO 48, Waikiki, Hawaii, December 5th 2015 P. Nikolaou 1

Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Modeling the Implications of DRAM Failures and Protection Techniques 

on Datacenter TCOPanagiota Nikolaou1, Yiannakis Sazeides1, Lorena Ndreu 1, Marios Kleanthous 2

1University of Cyprus, 2MAP S.Platis

MICRO 48, Waikiki, Hawaii, December 5th 2015P. Nikolaou 1

Page 2: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Many Million $ per month 

Today’s Datacenters

P. Nikolaou MICRO 48, Waikiki, Hawaii 2

> 510,000 DC in all over the world [Emerson, 2011] > 285 Million Sqft [Emerson, 2011]

Large scale Datacenters: >10,000 commodity servers

Page 3: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Datacenter Cost 

P. Nikolaou MICRO 48, Waikiki, Hawaii 3

Other Capex 

Expenses49%Other 

Opex Expenses

20%

DRAM          Opex & Capex 

Expenses31%

[Analysis using COST‐ET tool, D. Hardy 2013]

Other Capex 

Expenses49%Other 

Opex Expenses

20%

DRAM          Opex & Capex 

Expenses31%

Page 4: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

DRAM Protection Cost

P. Nikolaou MICRO 48, Waikiki, Hawaii 4

DRAMProtectionOpex & Capex8%

Other Capex & Opex Cost

69%

DRAM Opex & Capex Expenses

23%

[Analysis using COST‐ET tool, D. Hardy 2013]

Data ECC

Page 5: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Do we need DRAM protection?

P. Nikolaou MICRO 48, Waikiki, Hawaii 5

• Google Failure Study [Barroso, 2009]

• DRAM large field studies [V. Shridharan 2012, 2013]

DRAM FITS/ Server 

[Borucki, IRPS 2008], [K. Lim ISCA 2009], [Daniel Bowers, “Server Trends”] 

DRAM protection is essential !!

Page 6: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

DRAM protection choices

P. Nikolaou MICRO 48, Waikiki, Hawaii 6

ChipkillDC ChipkillSC SECDED

Cost

Reliability

++++

Performance +

Cost

Reliability

++++

Performance ++

Cost

Reliability

++++

Performance +++

Page 7: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

DRAM protection selection

P. Nikolaou MICRO 48, Waikiki, Hawaii 7

Application

*Analyzer of Memory Protection and Failures Implications on TCO (AMPRA tool), site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php

AMPRA* 

Page 8: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Our Proposition & our Contribution

P. Nikolaou MICRO 48, Waikiki, Hawaii 8

AMPRA tool

Best DRAM Protection Technique

Availability/MTTF Model

DRAM SDCModel

Thermal Model

DIMM Cost Model

Energy Model

DIMM FIT Model

Server Performance 

Model

ReliabilityPerformanceApplication 

CharacteristicsPower Error 

protection techniques

TCOModel

Page 9: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Related work• [Y. Luo DSN 2014] Proposes and analyzes cost of a heterogeneous memory protection scheme 

Differences:– Performance, power implications  of memory protection techniques

– Co‐located services

– Datacenter cost

• No other related work considers various parameters 

P. Nikolaou MICRO 48, Waikiki, Hawaii 9

Page 10: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Outline

• Proposed Framework (AMPRA tool)• Use Case • Experimental Framework• Results• Conclusions

P. Nikolaou MICRO 48, Waikiki, Hawaii 10

Page 11: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

DRAM FIT Model

Availability/MTTFModel

TCOModel

DIMMFITS_CE

Total extra servers

ComponentMTTF for

replacement

ECC technique

System configuration

DIMMFITS_DUE

HW and SW repair options and their MTTR

DIMMFITS_NDE

Fits per mode(transient, permanent, physical

location)

#devices/DIMM

DC configuration

Device width/size

DRAM Grade Factor

Model for proactive replacement

Maintenance model for replacement on faulty components

#DC servers Energy Model

ServerPerformance

Model

Server Energy

DIMMCost Model

DIMM cost

Device size

DRAM frequency

PublishedData

ECC technique

#devices/DIMM

Device width

DRAM brand

DRAM technology

Server configuration(#cores,Interleaving

type, #channels, DIMMs/channel)

Performance Degradation

(PD)

#threads for Online Service

#threads for Offline Service

PublishedData

#threads for Online Service#threads for Offline Service

DRAM SDCDerating

Model

DIMMDerated

FITS_NDE

Server Configuration

System Reliability

(MTTF_SDC)TCO

#threads for Online Service

#threads for Offline Service

Utilization Profile per day for the online service

ThermalModel

Non DRAM component Reference MTTF

#threads for Online Service

#threads for Offline ServiceComponent Temperature

ECC technique

Server configuration(#cores, Interleaving type,

#channels, DIMMs/channel)

ECC technique

Server configuration(#cores,Interleaving type,

#channels, DIMMs/channel)

Reference ECC technique

Component Reference Temperature

Average Utilization

PublishedData

NDE Derating FactorDIMM

FITS_SDC

Server configuration(#cores,Interleaving type,

#channels, DIMMs/channel)

Target Reliability

Proposed framework (AMPRA tool)

P. Nikolaou MICRO 48, Waikiki, Hawaii 11

Page 12: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Outline

• Proposed Framework (AMPRA tool)• Use Case • Experimental Framework• Results• Conclusions

P. Nikolaou MICRO 48, Waikiki, Hawaii 12

Page 13: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Use Case

P. Nikolaou MICRO 48, Waikiki, Hawaii 13

Bandwidth vs. Latency vs. Reliability vs. Power Chipkill with Dual Channel Implementation (ChipkillDC)Chipkill with Single Channel Implementation (ChipkillSC)16 ECC bits for 128 Data bits‐144 bit codeword

Memory Controller

Data ECC

8B8B8B8B8B8B8B8B

64BBlock

72b72b72b72b

12345678

Page 14: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

FIT model 

• ChipkillDC:– Detects all the errors in 2 devices  – Corrects all the errors in 1 device

• ChipkillSC:– Cannot detect all the errors in 2 devices  – Corrects all the errors in 1 device

ChipkillDC can provide better Reliability than ChipkillSC

P. Nikolaou MICRO 48, Waikiki, Hawaii 14

Page 15: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Performance and Power model How it works: (ChipkillDC)• Read

P. Nikolaou MICRO 48, Waikiki, Hawaii 15

•Requires accessing two DIMMs•Codeword in a single burst•Latency short•Low Bandwidth •High Power Consumption

Data ECC

Memory Controller

Data ECC

8B8B8B8B8B8B8B8B

64BBlock

72b72b 72b72b

Page 16: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Performance and Power model How it works: (ChipkillSC)• Read

P. Nikolaou MICRO 48, Waikiki, Hawaii 16

Data ECC

Memory Controller

•Requires accessing one DIMM•Codeword in two bursts•Latency long•High Bandwidth•Less Power Consumption

144 bits

8B8B8B8B8B8B8B8B

64BBlock

72b72b

Page 17: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Design Space

P. Nikolaou MICRO 48, Waikiki, Hawaii 17

• Application characteristics• Memory intensive, compute intensive etc. 

• Co‐running applications

What happens with the Cost?ChipkillDCChipkillSC

Reliability

Bandwidth

Latency

Cannot detect all the errors in 2 devices  Corrects all the errors in 1 device

Access one DIMM Access two DIMMs

Codeword in two bursts Codeword in one burst

Power Access one DIMM Access two DIMMs

Detect all the errors in 2 devices  Corrects all the errors in 1 device

Page 18: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

P. Nikolaou MICRO 48, Waikiki, Hawaii 18

Online Services: High QoS requirements

Offline Services: Do not have QoS constrains

Online and Offline Services

Co‐location: Improve server utilization and reduce TCO 

Core 0 Core 1 Core 3Core 2

Memory Controller

DRAM

Online Service 

Online Service

Core 0 Core 1

Offline Service

Offline Service

Core 2 Core 3

DRAM

Page 19: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Outline

• Proposed Framework (AMPRA tool)• Use Case• Experimental Framework• Results• Conclusions

P. Nikolaou MICRO 48, Waikiki, Hawaii 19

Page 20: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Experimental Framework

P. Nikolaou MICRO 48, Waikiki, Hawaii 20

Fit Model

Analytical Models

Performance, Power and Thermal Model ChipkillDC

–Lockstep Mode

ChipkillSC‐

Advance Mode

Server Configuration Intel Xeon E5‐56204 cores per CPU2 channels per CPU1 DIMM per Channel 

Workloads1. Web Search (QoS requirements) 2. MapReduce:a. 500MB (CPU intensive)b. 49000MB (memory  intensive)

DIMM Cost Public Data 

TCO Model

Extension COST‐ET Tool[D. Hardy 2013]

DC ConfigurationServer Modules: 50,000DC depreciation: 15year

Page 21: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

P. Nikolaou

DRAM Protection Implications on Performance

MICRO 48, Waikiki, Hawaii 21

WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB

Page 22: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

P. Nikolaou

DRAM Protection Implications on Power

MICRO 48, Waikiki, Hawaii 22

WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB

Page 23: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

P. Nikolaou MICRO 48, Waikiki, Hawaii 23

DRAM Protection Implications on Cost

• Underlines the importance of understanding the usage and characteristics of all the services to be run in a DC before making memory protection design choices

• Highlights the need of proposed framework !!

WS: Web SearchMR500: Map Reduce 500MBMR49000: Map Reduce 49000MB

Page 24: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Usage• Datacenter designers: Select processor and protectiontechnique

• Researchers: Investigate the implications of new ideasrelated to DRAM failures and DRAM protectiontechniques

• Service providers: Find how to charge for runningoffline services and to makeup for the increase in TCOdue to co‐location

P. Nikolaou MICRO 48, Waikiki, Hawaii 24

Page 25: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

More in the paper

• Detailed explanation of each model

• DRAM grades and how affect TCO

• Results for other protection techniques (SECDED)

• Power and performance results for more applications

P. Nikolaou MICRO 48, Waikiki, Hawaii 25

Page 26: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

• DRAM is one of the dominant cost consumers in a DC

• Different protection techniques have different TCO implications

• Framework to encapsulates all the parameters and tries to determine the cost‐effective protection technique for a DC

• Highlight the need of the framework – It is not straightforward to decide which DRAM protection technique is best for a DC setup in the lack of this framework

Conclusions

P. Nikolaou MICRO 48, Waikiki, Hawaii 26

Page 27: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

Future Work

• Evaluate TCO for more online and offline services

• Explore the cost‐benefits of new ECC schemes

• Validation of the framework by using detailed logs from a real DC

P. Nikolaou MICRO 48, Waikiki, Hawaii 27

Page 28: Modeling the Implications of DRAM and Techniques on TCO · Modeling the Implications of DRAM Failures and Protection Techniques on Datacenter TCO PanagiotaNikolaou 1, YiannakisSazeides

P. Nikolaou MICRO 48, Waikiki, Hawaii 28

AMPRA tool download site: http://www2.cs.ucy.ac.cy/carch/xi/ampra_tco.php