View
57
Download
0
Category
Tags:
Preview:
DESCRIPTION
NCAR’s Data-Centric Supercomputing Environment Yellowstone. December 21, 2011 Anke Kamrath, OSD Director/CISL anke @ ucar.edu. Overview. Strategy Moving from Process to Data-Centric Computing HPC/Data Architecture What we have today at ML What’s planned for NWSC - PowerPoint PPT Presentation
Citation preview
NCAR’s Data-Centric Supercomputing
Environment
YellowstoneDecember 21, 2011
Anke Kamrath, OSD Director/CISLanke@ucar.edu
2
Overview• Strategy
– Moving from Process to Data-Centric Computing• HPC/Data Architecture
– What we have today at ML– What’s planned for NWSC
• Storage/Data/Networking – Data in Flight– WAN and High-Performance LAN– Central Filesystem Plans– Archival Plans NWSC Now Open!
Evolving the Scientific Workflow
• Common data movement issues– Time consuming to move data between systems– Bandwidth to archive system is insufficient– Lack of sufficient disk space
• Need to evolve data management techniques– Workflow management systems– Standardize metadata information– Reduce/eliminate duplication of datasets (ie – CMIP5)
• User Education– Effective methods for understanding workflow– Effective methods for streamlining workflow
Traditional Workflow
Process Centric Data Model
Evolving Scientific WorkflowInformation Centric Data Model
Current Resources @ Mesa Lab
BLUEFIRE LYNXGLADE
NWSC: Yellowstone Environment
Partner Sites XSEDE Sites
Data TransferServices
Science GatewaysRDA, ESG
High Bandwidth Low Latency HPC and I/O NetworksFDR InfiniBand and 10Gb Ethernet
Remote Vis
1Gb/10Gb Ethernet (40Gb+ future)
NCAR HPSS Archive100 PB capacity
~15 PB/yr growth
Geyser & Caldera
DAV clusters
GLADECentral disk resource
11 PB (2012), 16.4 PB (2014)
YellowstoneHPC resource, 1.55 PFLOPS peak
NWSC-1 Resources in a Nutshell• Centralized Filesystems and Data Storage Resource (GLADE)
– >90 GB/sec aggregate I/O bandwidth, GPFS filesystems– 10.9 PetaBytes initially -> 16.4 PetaBytes in 1Q2014
• High Performance Computing Resource (Yellowstone)– IBM iDataPlex Cluster with Intel Sandy Bridge EP† processors with Advanced
Vector Extensions (AVX)– 1.552 PetaFLOPs – 29.8 bluefire-equivalents – 4,662 nodes – 74,592 cores– 149.2 TeraBytes total memory– Mellanox FDR InfiniBand full fat-tree interconnect
• Data Analysis and Visualization Resource (Geyser & Caldera)– Large Memory System with Intel Westmere EX processors
• 16 nodes, 640 cores, 16 TeraBytes memory, 16 NVIDIA Kepler GPUs– GPU-Computation/Vis System with Intel Sandy Bridge EP processors with AVX
• 16 nodes, 128 SB cores, 1 TeraByte memory, 32 NVIDIA Kepler GPUs– Knights Corner System with Intel Sandy Bridge EP processors with AVX
• 16 nodes, 128 SB cores, 992 KC cores, 1 TeraByte memory - Nov’12 delivery
† “Sandy Bridge EP” is the Intel® Xeon® E5-2670 processor
GLADE• 10.94 PB usable capacity 16.42 PB usable (1Q2014)
Estimated initial file system sizes– collections ≈ 2 PB RDA, CMIP5 data– scratch ≈ 5 PB shared, temporary space– projects ≈ 3 PB long-term, allocated space– users ≈ 1 PB medium-term work space
• Disk Storage Subsystem– 76 IBM DCS3700 controllers & expansion drawers
• 90 2-TB NL-SAS drives/controller • add 30 3-TB NL-SAS drives/controller (1Q2014)
• GPFS NSD Servers– 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes
• I/O Aggregator Servers (GPFS, GLADE-HPSS connectivity)– 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes
• High-performance I/O interconnect to HPC & DAV– Mellanox FDR InfiniBand full fat-tree– 13.6 GB/s bidirectional bandwidth/node
NCAR Disk Storage Capacity Profile
0
2
4
6
8
10
12
14
16
18
Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16
Tota
l Usa
ble
Stor
age
(PB)
Total Centralized Filesystem Storage (PB) GLADE (NWSC) GLADE (ML) bluefire
GLADE (Mesa Lab)
GLADE (at NWSC)
Yellowstone NWSC High-Performance Computing Resource• Batch Computation
– 4,662 IBM dx360 M4 nodes – 16 cores, 32 GB memory per node– Intel Sandy Bridge EP processors with AVX – 2.6 GHz clock– 74,592 cores total – 1.552 PFLOPs peak– 149.2 TB total DDR3-1600 memory– 29.8 Bluefire equivalents
• High-Performance Interconnect– Mellanox FDR InfiniBand full fat-tree– 13.6 GB/s bidirectional bw/node– <2.5 µs latency (worst case)– 31.7 TB/s bisection bandwidth
• Login/Interactive– 6 IBM x3650 M4 Nodes; Intel Sandy Bridge EP processors with AVX– 16 cores & 128 GB memory per node
NCAR HPC Profile
0.0
0.5
1.0
1.5
Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16
Peak PFLOPs at NCAR
IBM iDataPlex/FDR-IB (yellowstone)
Cray XT5m (lynx)
IBM Power 575/32 (128) POWER6/DDR-IB (bluefire)
IBM p575/16 (112) POWER5+/HPS (blueice)
IBM p575/8 (78) POWER5/HPS (bluevista)
IBM BlueGene/L (frost)
IBM POWER4/Colony (bluesky)
bluesky
bluefire
frost
lynx
yellowstone
bluevista blueice frost upgrade
30x Bluefire performance
Geyser and CalderaNWSC Data Analysis & Visualization Resource
• Geyser: Large-memory system– 16 IBM x3850 nodes – Intel Westmere-EX processors– 40 cores, 1 TB memory, 1 NVIDIA Kepler Q13H-3 GPU
per node– Mellanox FDR full fat-tree interconnect
• Caldera: GPU computation/visualization system– 16 IBM x360 M4 nodes – Intel Sandy Bridge EP/AVX– 16 cores, 64 GB memory, 2 NVIDIA Kepler Q13H-3 GPUs per node– Mellanox FDR full fat-tree interconnect
• Knights Corner system (November 2012 delivery)– Intel Many Integrated Core (MIC) architecture– 16 IBM Knights Corner nodes – 16 Sandy Bridge EP/AVX cores, 64 GB memory,
1 Knights Corner adapter per node– Mellanox FDR full fat-tree interconnect
ErebusAntarctic Mesoscale Prediction System (AMPS)
• IBM iDataPlex Compute Cluster– 84 IBM dx360 M4 Nodes; 16 cores, 32 GB– Intel Sandy Bridge EP; 2.6 GHz clock– 1,344 cores total – 28 TFLOPs peak– Mellanox FDR InfiniBand full fat-tree– 0.54 Bluefire equivalents
• Login Nodes– 2 IBM x3650 M4 Nodes– 16 cores & 128 GB memory per node
• Dedicated GPFS filesystem– 57.6 TB usable disk storage– 9.6 GB/sec aggregate I/O bandwidth
Erebus, on Ross Island, is Antarctica’s most famous volcanic peak and is one of the largest volcanoes in the world –
within the top 20 in total size and reaching a height of 12,450 feet.
0°
90° W 90° E
180°
Yellowstone Software• Compilers, Libraries, Debugger & Performance Tools
– Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users
– Intel VTune Amplifier XE performance optimizer 2 concurrent users– PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users– PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV
systems only, 2 concurrent users– PathScale EckoPath (Fortran C, C++, PathDB debugger)
20 concurrent users– Rogue Wave TotalView debugger 8,192 floating tokens– IBM Parallel Environment (POE), including IBM HPC Toolkit
• System Software– LSF-HPC Batch Subsystem / Resource Manager
• IBM has purchased Platform Computing, Inc. (developers of LSF-HPC)– Red Hat Enterprise Linux (RHEL) Version 6– IBM General Parallel Filesystem (GPFS)– Mellanox Universal Fabric Manager– IBM xCAT cluster administration toolkit
NCAR HPSS Archive Resource• NWSC
– Two SL8500 robotic libraries (20k cartridge capacity)– 26 T10000C tape drives (240 MB/sec I/O rate each) and
T10000C media (5 TB/cartridge, uncompressed) initially; +20 T10000C drives ~Nov 2012
– >100 PB capacity– Current growth rate ~3.8 PB/year– Anticipated NWSC growth rate ~15 PB/year
• Mesa Lab– Two SL8500 robotic libraries (15k cartridge capacity)– Existing data (14.5 PB):
• 1st & 2nd copies will be ‘oozed’ to new media @ NWSC, begin 2012
– New data @ Mesa:• Disaster-recovery data only
– T10000B drives & media to be retired– No plans to move Mesa Lab SL8500 libraries (more costly
to move than to buy new under AMSTAR Subcontract)
Plan to release an “AMSTAR-2” RFP 1Q2013, with target for first equipment delivery during 1Q2014 to further augment the NCAR HPSS Archive.
Yellowstone Physical Infrastructure
Total Power Required ~2.13 MWYellowstone ~1.9 MW
GLADE 0.134 MW
Geyser & Caldera 0.056 MW
Erebus (AMPS) 0.087 MW
Resource # Racks
Yellowstone 65 - iDataPlex Racks (72 nodes per rack)9 - 19” Racks (9 Mellanox FDR core switches)1 - 19” Rack (login, service, management nodes)
GLADE 20 - NSD Server, Controller and Storage Racks1 - 19” Rack (I/O aggregator nodes, management , IB & Ethernet switches)
Geyser & Caldera
1 - iDataPlex Rack (GPU-Comp & Knights Corner)2 - 19” Racks (Large Memory, management , IB switch)
Erebus (AMPS)
1 - iDataPlex Rack1 - 19” Rack (login, IB, NSD, disk & management nodes)
Yellowstone allocations (% of resource)NCAR’s 29% represents 170 million core-hours per year for Yellowstone alone (compared to less than 10 million per year on Bluefire) plus a similar fraction of the DAV and GLADE resources.
Yellowstone Schedule31
Oct
14 N
ov
28 N
ov
12 D
ec
26 D
ec
9 Ja
n
23 J
an
6 Fe
b
20 F
eb
5 M
ar
19 M
ar
2 A
pr
16 A
pr
30 A
pr
14 M
ay
28 M
ay
11 J
un
25 J
un
9 Ju
l
23 J
ul
6 A
ug
20 A
ug
3 Se
p
17 S
ep
1 O
ct
15 O
ct
Infrastructure Prepara-tion
Storage & InfiniBand Delivery & Installation
Test Systems Delivery & Installation
Production Systems Delivery
Integration & Checkout
Acceptance Testing
ASD & early users
Production Science
Current Schedule
20
Data in Flight to NWSC• NWSC Networking• Central Filesystem
– Migrating from GLADE-ML to GLADE-NWSC• Archive
– Migrating HPSS data from ML to NWSC
21
a
22
BiSON to NWSC Initially three 10G circuits active
Two 10G connections back to Mesa Lab for internal traffic One 10G direct to FRGP for general Internet2 / NLR /
Research and Education traffic Options for dedicated 10G connections for high performance
computing to other BiSON members System is engineered for 40 individual lambdas
Each lambda can be a 10G, 40G, or 100G connection Independent lambdas can be sent each direction around the
ring (two ADVA shelves at NWSC – one for each direction) With a major upgrade system could support 80 lambdas 100Gbps * 80 channels * 2 paths = 16Tbps
23
High performance LAN Data Center Networking (DCN)
high speed data center computing with 1G and 10G client facing ports for supercomputer, mass storage, and other data center components
redundant design (e.g., multiple chassis and separate module connections)
future option for 100G interfaces Juniper awarded RFP after NSF approval
Access switches: QFX3500 series Network core and WAN edge: EX8200 series
switch/routers Includes spares for lab/testing purposes
Juniper training for NETS staff early Jan-2012 Deploy Juniper equipment late Jan-2012
24
Moving Data… Ugh
25
Migrating GLADE Data• Temporary work spaces (/glade/scratch, /glade/user)
– No data will automatically be moved to NWSC• Allocated project spaces (/glade/projxx)
– New allocations will be made for the NWSC– No data will automatically be moved to NWSC– Data transfer option so users may move data they need
• Data collections (/glade/data01, /glade/data02)– CISL will move data from ML to NWSC– Full production capability will need to be maintained during the
transition Network Impact
- Current storage max performance is 5GB/s- Can sustain ~2GB/s for reads while under a production load- Will move 400TB in a couple of days, however we will saturate a
20Gb/s network link
26
Migrating Archive• What Migrates????
• Data: 15PBs and counting…• Format: MSS to HPSS• Tape: Tape B (1TB tapes) to Tape C (5TB tapes)• Location: ML to NWSC
• HPSS at NWSC to become primary site in Spring 2012 • 1 day outage when metadata servers get moved• ML HPSS will remain as Disaster Recovery Site• Data Migration will take until early 2014 • Will throttle migration to not overload network
WHEW… A LOT OF WORK AHEAD!
QUESTIONS?
Recommended