Upload
lydang
View
220
Download
6
Embed Size (px)
Citation preview
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-1
LANL High Performance Computing Facilities Operations
Rick Rivera and Farhad Banisadr
Facility Data Center Management High Performance Computing Division
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-2
Outline
• Overview - LANL Computer Data Centers
• Power and Cooling
• CFD Data Center Modeling
• Data Center Power Usage and Efficiency
• Exa-Scale Challenges
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-3
High Performance Computing Division Mission
Slide 3
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-4
World-class facilities to support high performance computing systems
Nicholas C. Metropolis Center for Modeling and Simulation 2002 - 43,500 sq ft computer room floor - 9.6 MW Rotary UPS power - 9.6 MW Commercial power - 6000 tons chilled water cooling Laboratory Data Communications Complex (LDCC) – 1989 - 20,000 sq ft computer room - 8 MW Commercial power - 3150 tons chilled water cooling
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-5
High Performance Computing Systems Secure Classified Computers - SCC
Redtail—IBM, 64.6 teraflop/s Roadrunner—IBM (Cell), 1376 teraflop/s Hurricane—Appro, 80 teraflop/s Typhoon—Appro, 100 teraflop/s Cielo—Cray XE6, 1370 teraflop/s
Open Unclassified Computers - LDCC Yellowrail—IBM, 4.9 teraflop/s Cerrillos—IBM, 160 teraflop/s Turing—Appro, 22 teraflop/s Conejo—SGI, 52 teraflop/s Mapache—SGI, 50 teraflop/s Mustang—Appro, 353 teraflop/s
Roadrunner 15kw/rack
Cielo 54kw/rack
Slide 5
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-6
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-7
Electrical Power Distribution Metropolis Center (SCC) Building Specific
SCC Features • Multiple 13.2 kV feeders • Double ended substations • Rotary uninterruptible power
supply (RUPs) • Scalability • Power conditioning transient
voltage surge suppressor (TVSS)
RUPS power distribution
+
Switchgear
Substation
PDU power distribution
Supercomputing Systems
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-8
Metropolis Center (SCC) Infrastructure Upgrade Project
• Split sub-stations (19.2 MW)
• Added 1,200-ton chiller • Added cooling tower • 14 additional air handling
units (AHUs) • Water cooling
(9 MW load)
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-9 Air Systems Cold air generated by the air handling units is forced up through perforated tiles above the computer room sub floor. The air passes through and removes heat from the supercomputers, then is vented through the ceiling tiles and return plenums to complete the path back to the air handling units below.
As heat densities associated with supercomputers increase, modeling becomes critical. This graphic shows a computer room air temperature profile.
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-10
Supercomputers require continuous and significant cooling Heat Flow This diagram traces the path of heat flow through the different cooling plant components, fluids, and systems.
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-11
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-12
LANL/Cielo Update
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-13
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-14
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-15
Comparative Energy Costs
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-16
PUE Power Usage Effectiveness (PUE) • Total Facility Power (kW) • IT Equipment Power (kW) Power Usage
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-17
Calculating PUE • What to measure and how often? • Importance of consistency • How to measure
– Manual readings of BAS, UPS, PDUs – Instrumentation – Real time measurement
• Wireless meters and sensors • Branch circuit monitoring • Power usage software
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-18
SCC Power Usage Effectiveness
Cool
ing
Powe
r
The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location.
Com
putin
g Po
wer
+ = 2.78 MW
= 7.89 MW +
Substation DUS-1, D, E 99% efficient
cooling towers
chiller plant AHUs
13.2 kV Power Distribution Switchgear
Substation A, B, C, H 99% efficient
RUPS - 92% efficient SWB - 100% efficient
IT Load 67%
Chillers 14% AHU/
Pumps 18%
1% Misc
SCC Computer Load • Facility power 11.8 MW • IT power 7.89 MW • 11.8/7.89 = 1.49 (PUE)
PDUS - 96% efficient
• HVAC power 2.78 MW • HVAC Load 2206 Tons (7.76 MW) • HVAC KW/Ton 1.26 • 7.76/2.78 = 2.79 (COP)
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-19
HPC current and future efficiency practices to reduce PUE
• Data Center Metering/Measurements • Variable Speed Drives (VSD) • Cold Aisle Containment • Close-Coupled Liquid Cooling • Air-Side Economizers • Water-Side Economizers
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-20
Circulation power and cooling fluid choice • Air cooling
– Circulation power grows very rapidly • 9% is typical for low power density datacenters using CRAC units • Doubling power density could increase this 8-fold
– Large heat sinks limit rack packing efficiency • Water cooling
– About 3500 times more effective than air at heat removal, lower pumping cost – Plumbing costs, earthquake bracing needed – Risk of leaks, condensation near computers in humid climates – Attractive at high power densities – Variant
• Short air loop to nearby heat exchanger
• Freon or similar fluids – Stable temperature, governed by phase transition – Risk of leaks, inert but displaces oxygen – More complex plant, additional pumping costs, much variation in design & efficiency
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-21
Multiple Equipment Summary
View
Individual Substation
View
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-22
Power Usage Effectiveness (PUE) PUE is a measure of how efficiently a computer data center uses its power. It is a ratio of total facility power to IT equipment power.
PUE TOTAL FACILITY
POWER IT EQUIPMENT POWER
Facility power: lighting, cooling, etc. IT equipment: computer racks, storage, etc.
NOTE: There was a communication loss on May 7th.
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-23
Data Center Infrastructure Efficiency (DCIE) DCIE is the percentage value derived for measuring computer data center efficiency by taking the inverse of the PUE.
DCIE TOTAL FACILITY POWER IT EQUIPMENT POWER x 100
NOTE: There was a communication loss on May 7th.
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-24
Cielo Power (kW)
Redtail Power (kW)
Roadrunner Power (kW)
Power Usage Effectiveness - PUE
NOTE: There was a communication loss on May 7th.
Individual Machine Monitoring and Resulting PUE
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-25
Data Center Infrastructure Efficiency – DCIE (%)
Power Usage Effectiveness - PUE
IT Load (kW) Total Load (kW)
NOTE: There was a communication loss on May 7th.
Metropolis Center Overall Performance May 2011
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-26
Identify Direct Correlations
Cielo Power (kW)
Redtail Power (kW)
Roadrunner Power (kW)
Total IT Load (kW)
Total Facility Load (kW)
PUE
DCiE (%)
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-27
WIRELESS Network Topology Thermal Monitoring
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-28
Ceiling Temperatures, Humidity, and Under Floor Air Pressure
103°F
Static Pressure 0.5 in/wc
65°F
Humidity 53% Humidity 49%
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-29
Wireless Information Summary from Cielo
Exhaust Temp (°F)
Ceiling Temp (°F)
Supply Temp (°F)
Return Temp (°F)
Power (kW)
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-30
DCiE = Data Center Infrastructure Efficiency
74.81 %
PUE = Power Usage Effectiveness
1.337
PUE/DCiE 2.0/50% =AVERAGE 1.5/67% = EFFICIENT 1.2/83% = VERY EFFICIENT
SCC – Data Center Efficiencies Live values recorded May 31, 2011 at 8:00 am CIELO
1578 kW
ROADRUNNER 1568 kW
REDTAIL 1746 kW MISC. IT LOAD 1257 kW
LA-UR 11-03816
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-31
Challenges Exa-Scale And Beyond
• Power • Cooling • Weight • Environmental
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-32
Energy Efficient High Performance Computing Working Group (EE HPC WG)
• Supported by the DOE Sustainability Performance Office
• Organized and led by Lawrence Berkeley National Lab
• Participants from DOE National Laboratories, Academia,
various Federal Agencies
• HPC vendor participation
• Group selects energy related topics to develop
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-33
Liquid Cooling Sub-committee Goal: Encourage highly efficient liquid cooling through use of high temperature fluids
• Eliminate compressor cooling (chillers) at National Laboratory locations
• Standardize temperature requirements – common understanding between HPC mfgs and sites
• Ensure practicality of recommendations - Collaboration with HPC vendor community to develop attainable recommended limits
• Industry endorsement of recommended limits - Collaboration with ASHRAE to adopt recommendations in new thermal guidelines white paper
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-34
Direct Liquid Cooling Architectures
Cooling Tower
Dry Cooler
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D
HPC HPC Division High Performance Computing
LAUR-09-04524
Slide-35
Summary Table to ASHRAE
Liquid Cooling Class
Main Cooling
Equipment
Supplemental Cooling
Equipment
Building Supplied Cooling Liquid
Maximum Temperature
L1 Cooling Tower and Chiller Not Needed 17°C
(63°F)
L2 Cooling Tower Chiller 32°C (89°F)
L3 Dry Cooler Spray Dry Cooler or Chiller
43°C (110°F)