18
Architecture for Architecture for Dynamic Thermal Dynamic Thermal Management in Management in Datacenters Datacenters Tridib Mukherjee Tridib Mukherjee Graduate Research Assistant Graduate Research Assistant IMPACT Lab IMPACT Lab (www.impact.asu.edu) (www.impact.asu.edu) Department of Comp. Sc. & Department of Comp. Sc. & Engg. Engg. Arizona State University Arizona State University

Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

Embed Size (px)

Citation preview

Page 1: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

Software Architecture for Software Architecture for Dynamic Thermal Dynamic Thermal

Management in DatacentersManagement in Datacenters

Tridib MukherjeeTridib Mukherjee

Graduate Research AssistantGraduate Research Assistant

IMPACT Lab (www.impact.asu.edu)IMPACT Lab (www.impact.asu.edu)

Department of Comp. Sc. & Engg.Department of Comp. Sc. & Engg.

Arizona State UniversityArizona State University

Page 2: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

22

OutlineOutline

MotivationMotivation

Dynamic Thermal Management in Dynamic Thermal Management in DatacentersDatacenters

Thermal-aware task schedulingThermal-aware task scheduling

Software ArchitectureSoftware Architecture

Conclusions and Future workConclusions and Future work

Page 3: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

33

MotivationMotivation Computing clusters are increasingly Computing clusters are increasingly

deployed in current datacenters limited by deployed in current datacenters limited by power and thermal capacitypower and thermal capacity

• High server density to achieve higher High server density to achieve higher computation capability - computation capability - Leads to high Leads to high heat densityheat density

• Reliability and longevity of the overheated Reliability and longevity of the overheated servers is affected - servers is affected - System downtime System downtime may increasemay increase

Rising costRising cost for datacentersfor datacenters• Large scale datacenters can run into Large scale datacenters can run into

millions of dollars - millions of dollars - Cooling cost Cooling cost comprises almost half of thiscomprises almost half of this

• Current trend of overcooling based on Current trend of overcooling based on worst case thermal characteristics lead to worst case thermal characteristics lead to high utilities costhigh utilities cost

A dynamic thermal-aware A dynamic thermal-aware control platform is necessary for control platform is necessary for online thermal evaluation that can online thermal evaluation that can achieve a tradeoff between these achieve a tradeoff between these extremes.extremes.

Page 4: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

44

Thermal Management of Thermal Management of Datacenter Datacenter

Motivation and significanceMotivation and significance Compute Intensive Applications (Online Gaming, Computer Movie Compute Intensive Applications (Online Gaming, Computer Movie

Animation, Data Mining) requiring increased utilization of Data Animation, Data Mining) requiring increased utilization of Data CenterCenter

• Maximizing computing capacity is a demanding requirementMaximizing computing capacity is a demanding requirement New blade servers can be packed more denselyNew blade servers can be packed more densely Energy cost is rising dramaticallyEnergy cost is rising dramatically

GoalGoal• Improving thermal performanceImproving thermal performance• Lowering hardware failure rateLowering hardware failure rate• Reducing energy costReducing energy cost

Page 5: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

55

Typical layout of a datacenterTypical layout of a datacenter Rack outlet temperature TRack outlet temperature Toutout

Rack inlet temperature TRack inlet temperature Tinin

Air conditioner supply temperature TAir conditioner supply temperature Tss

Page 6: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

66

Schematic View of Thermal ManagementSchematic View of Thermal Management

C o n tro l

F eed b ack

T ran sd u ce r

Se ns o r D ataD atabas e

C FD s im ulat io ns o f tware

P o lic yC o ntro l le r

M o abSc he dule r

O the r Im pac tfac to r s

C o lle c t ing e nviro nm e ntal data andlo ad info rm atio n f ro m s e ns o rs

`

C o rre lat io n o flo ad & po we r

C o s t Analys is

Sc he duling P o l ic y

C o ntro l P o l ic y

Inc o m ing tas k

O ns i te s urve y

M a p loa d to pow e rc ons um ption

P ro c e s sM igrat io n

H is to ry Se ns o r D ata

C ur re nt Se ns o r D ata

D atac enter

Abs trac t H e atM o de l

T arg e t

Page 7: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

77

Research Issues of Thermal Research Issues of Thermal Management in DatacenterManagement in Datacenter

Abstract HeatFlow Model

Power & LoadCharacterization

Modeling Thermal Performance

Multiscale & Multimodal Info

Analysis

ThermalPerformanceEvaluation

CostOptimization

SchedulerOther Impact

Factors

Understanding

Control

Page 8: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

88

Task scheduling and Thermal Distribution Co-Task scheduling and Thermal Distribution Co-

relationrelation

Reaction Reaction ChainChain

Scheduling Requirements

Real-time measurement

Online lightweight temperature prediction

Thermal-awareness in the scheduling decisions

Task Assignment

Power Consumption Distribution

TemperatureDistribution

Energy Cost

Task Assignment

Power Consumption Distribution

Inlet temperaturedistributionwithout Cooling

25C

25C

Cooling lowered Inlet temperature lowered Blow redline threshold

Demand forcooling load /energy

Demand forcooling load/energy

Page 9: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

99

Thermal-aware scheduling TechniquesThermal-aware scheduling Techniques

Uniform Task distribution (UT) Uniform Task distribution (UT) Assigning all chassis the same amount of tasks Assigning all chassis the same amount of tasks

(power consumptions)(power consumptions)

Uniform Outlet Profile (UOP)Uniform Outlet Profile (UOP) Assigning tasks in a way trying to achieve outlet Assigning tasks in a way trying to achieve outlet

temperature balance (uniform distribution)temperature balance (uniform distribution)

Minimum Computing Energy (coolest inlet) (MCE)Minimum Computing Energy (coolest inlet) (MCE) Assigning tasks in a way to keep the number of Assigning tasks in a way to keep the number of

active (power on) chassis as small as possibleactive (power on) chassis as small as possible

Recirculation Minimized Scheduling (XInt)Recirculation Minimized Scheduling (XInt) Use profiling process to calculate cross Use profiling process to calculate cross

interference coefficientsinterference coefficients

Page 10: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1010

Total Energy Cost Total Energy Cost ComparisonsComparisons

Page 11: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1111

System Model & Cluster Set-upSystem Model & Cluster Set-up

Ne two rk

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

R e m o teC lie n t

S e rv e rs a g u a ro . fu lto n .a s u .e du

I n te l 6 4 - b itX eo n E M 6 4 TD u al- p r o c es s o rS er v er s

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

Po

we

rEd

ge

24

50

C h as s is

R ac k

C h as s is 0

C h as s is 4

R ac k 0 R ac k 3

S a g u a roC lu s te r

Saguaro Cluster is Saguaro Cluster is the main cluster the main cluster maintained by the maintained by the High Performance High Performance Computing Initiative Computing Initiative at ASU.at ASU.

• 4 racks, 5 chassis 4 racks, 5 chassis per rack, 10 dual-per rack, 10 dual-processors per processors per chassischassis

Page 12: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1212

Cluster Management S/W Cluster Management S/W InfrastructureInfrastructure

We used Moab scheduler for job allocation in this cluster.We used Moab scheduler for job allocation in this cluster.• Easy to useEasy to use• Provides good graphical interface in the form of Moab Provides good graphical interface in the form of Moab

Cluster Manager (MCM).Cluster Manager (MCM).• Job re-allocation is allowed based on priorityJob re-allocation is allowed based on priority• uses of the underlying resource management software uses of the underlying resource management software

(such as torque) and enforces the scheduling policies (such (such as torque) and enforces the scheduling policies (such as fair-share) selected from the GUIas fair-share) selected from the GUI

Thermal awareness is integrated into the Moab Thermal awareness is integrated into the Moab Scheduler.Scheduler.

• Priority is set as a function of temperature, utilization, etc.Priority is set as a function of temperature, utilization, etc.

PHP based datacenter visualization.PHP based datacenter visualization.

Moab Cluster Management GUI

Moab Server

Resource Management (Torque)

Data Center

Page 13: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1313

Chassis Level Sensor Data Chassis Level Sensor Data CollectionCollection

SNMP based script SNMP based script periodically queries periodically queries sensors and updates sensors and updates server databaseserver database

PHP script periodically PHP script periodically accesses the database accesses the database for presenting the for presenting the thermal history in the thermal history in the webpagewebpage

11 outlet Temperature sensors at back

of the chassis

3 housing Temperature sensors at middle

of the chassis

Sensor Placement at each chassis*

* There is only one inlet sensor at the front of the chassis

Page 14: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1414

Visualization and Scheduler Visualization and Scheduler IntegrationIntegration

Temperature data is Temperature data is included as Generic included as Generic Metric (GMETRIC) in Metric (GMETRIC) in Moab.Moab.

Node priority is set Node priority is set based on moab based on moab GMETRIC data. GMETRIC data.

Page 15: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1515

Putting it all together: Putting it all together: Software ArchitectureSoftware Architecture

M C M S er v er h is to r y in w eb p ag e

M o ab S c h ed u ler

T O R Q UE R es o u r c eM an ag er S er v er

His to r y o fS en s o r R ead in g

N ag io s S c r ip t

T O R Q UE R es o u r c eM an ag er C lien t

P HP S c r ip t

M o ab G M E T R I C D ataP r o v id er

R e m o te C lie n t

S e rv e rs a g u a ro . fu lto n .a s u .e du

I n te l 6 4 -bit X e o n EM 6 4 TD u a l-pro ce s s o r S e rv e rs

L o ca l D e s k to p1 2 9 .2 1 9 .3 3 .2 3 2

Presentation

Scheduling Control

DatacenterServers

Access data from the chassis level sensors

Page 16: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1616

Modularized Implementation of Thermal Modularized Implementation of Thermal Awareness in Task SchedulingAwareness in Task Scheduling

Page 17: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

1717

ConclusionsConclusions

Proposed Architecture Proposed Architecture enables dynamic on-line thermal management during enables dynamic on-line thermal management during

datacenter operation.datacenter operation. provides visualization of thermal distributionprovides visualization of thermal distribution

Implemented in fully operational ASU Implemented in fully operational ASU datacenter.datacenter.

Prototype development and demonstration at the Prototype development and demonstration at the Research @ Intel day.Research @ Intel day.

Page 18: Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab () Department

Questions ??Questions ??