36
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D HPC HPC Division High Performance Computing LAUR-09-04524 Slide-1 LANL High Performance Computing Facilities Operations Rick Rivera and Farhad Banisadr Facility Data Center Management High Performance Computing Division

LANL High Performance Computing Facilities Operationsrice2012oghpc.rice.edu/files/2012/03/Plenary2-LANL.pdf · High Performance Computing Division Mission ... modeling becomes critical

  • Upload
    lydang

  • View
    220

  • Download
    6

Embed Size (px)

Citation preview

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-1

LANL High Performance Computing Facilities Operations

Rick Rivera and Farhad Banisadr

Facility Data Center Management High Performance Computing Division

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-2

Outline

•  Overview - LANL Computer Data Centers

•  Power and Cooling

•  CFD Data Center Modeling

•  Data Center Power Usage and Efficiency

•  Exa-Scale Challenges

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-3

High Performance Computing Division Mission

Slide 3

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-4

World-class facilities to support high performance computing systems

Nicholas C. Metropolis Center for Modeling and Simulation 2002 - 43,500 sq ft computer room floor -  9.6 MW Rotary UPS power -  9.6 MW Commercial power - 6000 tons chilled water cooling Laboratory Data Communications Complex (LDCC) – 1989 -  20,000 sq ft computer room - 8 MW Commercial power - 3150 tons chilled water cooling

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-5

High Performance Computing Systems Secure Classified Computers - SCC

Redtail—IBM, 64.6 teraflop/s Roadrunner—IBM (Cell), 1376 teraflop/s Hurricane—Appro, 80 teraflop/s Typhoon—Appro, 100 teraflop/s Cielo—Cray XE6, 1370 teraflop/s

Open Unclassified Computers - LDCC Yellowrail—IBM, 4.9 teraflop/s Cerrillos—IBM, 160 teraflop/s Turing—Appro, 22 teraflop/s Conejo—SGI, 52 teraflop/s Mapache—SGI, 50 teraflop/s Mustang—Appro, 353 teraflop/s

Roadrunner 15kw/rack

Cielo 54kw/rack

Slide 5

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-6

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-7

Electrical Power Distribution Metropolis Center (SCC) Building Specific

SCC Features • Multiple 13.2 kV feeders • Double ended substations • Rotary uninterruptible power

supply (RUPs) • Scalability • Power conditioning transient

voltage surge suppressor (TVSS)

RUPS power distribution

+

Switchgear

Substation

PDU power distribution

Supercomputing Systems

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-8

Metropolis Center (SCC) Infrastructure Upgrade Project

• Split sub-stations (19.2 MW)

• Added 1,200-ton chiller • Added cooling tower • 14 additional air handling

units (AHUs) • Water cooling

(9 MW load)

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-9 Air Systems Cold air generated by the air handling units is forced up through perforated tiles above the computer room sub floor. The air passes through and removes heat from the supercomputers, then is vented through the ceiling tiles and return plenums to complete the path back to the air handling units below.

As heat densities associated with supercomputers increase, modeling becomes critical. This graphic shows a computer room air temperature profile.

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-10

Supercomputers require continuous and significant cooling Heat Flow This diagram traces the path of heat flow through the different cooling plant components, fluids, and systems.

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-11

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-12

LANL/Cielo Update

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-13

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-14

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-15

Comparative Energy Costs

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-16

PUE Power Usage Effectiveness (PUE) •  Total Facility Power (kW) •  IT Equipment Power (kW) Power Usage

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-17

Calculating PUE •  What to measure and how often? •  Importance of consistency •  How to measure

–  Manual readings of BAS, UPS, PDUs –  Instrumentation – Real time measurement

•  Wireless meters and sensors •  Branch circuit monitoring •  Power usage software

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-18

SCC Power Usage Effectiveness

Cool

ing

Powe

r

The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location.

Com

putin

g Po

wer

+ = 2.78 MW

= 7.89 MW +

Substation DUS-1, D, E 99% efficient

cooling towers

chiller plant AHUs

13.2 kV Power Distribution Switchgear

Substation A, B, C, H 99% efficient

RUPS - 92% efficient SWB - 100% efficient

IT Load 67%

Chillers 14% AHU/

Pumps 18%

1% Misc

SCC Computer Load • Facility power 11.8 MW • IT power 7.89 MW • 11.8/7.89 = 1.49 (PUE)

PDUS - 96% efficient

• HVAC power 2.78 MW • HVAC Load 2206 Tons (7.76 MW) • HVAC KW/Ton 1.26 • 7.76/2.78 = 2.79 (COP)

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-19

HPC current and future efficiency practices to reduce PUE

•  Data Center Metering/Measurements •  Variable Speed Drives (VSD) •  Cold Aisle Containment •  Close-Coupled Liquid Cooling •  Air-Side Economizers •  Water-Side Economizers

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-20

Circulation power and cooling fluid choice •  Air cooling

–  Circulation power grows very rapidly •  9% is typical for low power density datacenters using CRAC units •  Doubling power density could increase this 8-fold

–  Large heat sinks limit rack packing efficiency •  Water cooling

–  About 3500 times more effective than air at heat removal, lower pumping cost –  Plumbing costs, earthquake bracing needed –  Risk of leaks, condensation near computers in humid climates –  Attractive at high power densities –  Variant

•  Short air loop to nearby heat exchanger

•  Freon or similar fluids –  Stable temperature, governed by phase transition –  Risk of leaks, inert but displaces oxygen –  More complex plant, additional pumping costs, much variation in design & efficiency

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-21

Multiple Equipment Summary

View

Individual Substation

View

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-22

Power Usage Effectiveness (PUE) PUE is a measure of how efficiently a computer data center uses its power. It is a ratio of total facility power to IT equipment power.

PUE TOTAL FACILITY

POWER IT EQUIPMENT POWER

Facility power: lighting, cooling, etc. IT equipment: computer racks, storage, etc.

NOTE: There was a communication loss on May 7th.

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-23

Data Center Infrastructure Efficiency (DCIE) DCIE is the percentage value derived for measuring computer data center efficiency by taking the inverse of the PUE.

DCIE TOTAL FACILITY POWER IT EQUIPMENT POWER x 100

NOTE: There was a communication loss on May 7th.

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-24

Cielo Power (kW)

Redtail Power (kW)

Roadrunner Power (kW)

Power Usage Effectiveness - PUE

NOTE: There was a communication loss on May 7th.

Individual Machine Monitoring and Resulting PUE

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-25

Data Center Infrastructure Efficiency – DCIE (%)

Power Usage Effectiveness - PUE

IT Load (kW) Total Load (kW)

NOTE: There was a communication loss on May 7th.

Metropolis Center Overall Performance May 2011

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-26

Identify Direct Correlations

Cielo Power (kW)

Redtail Power (kW)

Roadrunner Power (kW)

Total IT Load (kW)

Total Facility Load (kW)

PUE

DCiE (%)

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-27

WIRELESS Network Topology Thermal Monitoring

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-28

Ceiling Temperatures, Humidity, and Under Floor Air Pressure

103°F

Static Pressure 0.5 in/wc

65°F

Humidity 53% Humidity 49%

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-29

Wireless Information Summary from Cielo

Exhaust Temp (°F)

Ceiling Temp (°F)

Supply Temp (°F)

Return Temp (°F)

Power (kW)

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-30

DCiE = Data Center Infrastructure Efficiency

74.81 %

PUE = Power Usage Effectiveness

1.337

PUE/DCiE 2.0/50% =AVERAGE 1.5/67% = EFFICIENT 1.2/83% = VERY EFFICIENT

SCC – Data Center Efficiencies Live values recorded May 31, 2011 at 8:00 am CIELO

1578 kW

ROADRUNNER 1568 kW

REDTAIL 1746 kW MISC. IT LOAD 1257 kW

LA-UR 11-03816

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-31

Challenges Exa-Scale And Beyond

•  Power •  Cooling •  Weight •  Environmental

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-32

Energy Efficient High Performance Computing Working Group (EE HPC WG)

•  Supported by the DOE Sustainability Performance Office

•  Organized and led by Lawrence Berkeley National Lab

•  Participants from DOE National Laboratories, Academia,

various Federal Agencies

•  HPC vendor participation

•  Group selects energy related topics to develop

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-33

Liquid Cooling Sub-committee Goal: Encourage highly efficient liquid cooling through use of high temperature fluids

•  Eliminate compressor cooling (chillers) at National Laboratory locations

•  Standardize temperature requirements – common understanding between HPC mfgs and sites

•  Ensure practicality of recommendations - Collaboration with HPC vendor community to develop attainable recommended limits

•  Industry endorsement of recommended limits - Collaboration with ASHRAE to adopt recommendations in new thermal guidelines white paper

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-34

Direct Liquid Cooling Architectures

Cooling  Tower  

Dry  Cooler  

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-35

Summary Table to ASHRAE

Liquid  Cooling  Class  

Main  Cooling  

Equipment  

Supplemental  Cooling  

Equipment  

Building  Supplied  Cooling  Liquid  

Maximum  Temperature  

L1   Cooling  Tower  and  Chiller   Not  Needed   17°C  

(63°F)  

L2   Cooling  Tower   Chiller   32°C  (89°F)  

L3   Dry  Cooler  Spray  Dry  Cooler  or  Chiller  

43°C  (110°F)  

Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D

HPC HPC Division High Performance Computing

LAUR-09-04524

Slide-36

Thank You Questions?

LA-UR 11-03816