Upload
alfred-cody-patrick
View
215
Download
0
Embed Size (px)
Citation preview
Larry Marx and the Project Athena Team
OutlineProject Athena ResourcesModels and Machine UsageExperimentsRunning ModelsInitial and Boundary Data PreparationPost Processing, Data Selection and CompressionData Management
Kraken
8256 nodes@ 12
cores, 16 GB mem
Kraken
8256 nodes@ 12
cores, 16 GB mem
scratch78 TB
(Lustre)
homes8 TB
(NFS)
nakji360 TB(Lustre)
Athena
4512 nodes @ 4 cores, 2 GB mem
Verne5 nodes
@ 32 cores, 128 GB mem
Verne5 nodes
@ 32 cores, 128 GB mem
Dedicated, Oct’09 – Mar’1079 million core-hours
Dedicated, Oct’09 – Mar’10post-processing
Shared, Oct’09 – Mar’105 million core-hours
Read-only
800+ TB HPSS tape archive
Models and Machine Usage NICAM initially was the primary focus of implementation
Limited flexibility in scaling, due to icosahedral grid Limited testing on multicore/cache processor architectures;
production primarily on the vector-parallel (NEC SX) Earth Simulator
Step 1: Port low resolution version with simple physics to Athena Step 2: Determine highest resolution possible on Athena and
minimum and maximum number of cores to be used Unique solution: G-level = 10 or 10,485,762 cells (7-km spacing)
using exactly 2,560 cores Step 3: Initially NICAM jobs failed frequently due to improper
namelist settings. During visit by U. Tokyo and JAMSTEC scientists to COLA, new settings determined that generally ran with little trouble. However 2003 could never be stabilized and was abandoned.
Models and Machine Usage (cont’d) IFS flexible scalability sustains good performance for higher resolution
configurations (T1279 and T2047) using 2,560 processor cores We defined one “slot” as 2,560 cores and managed a mix of NICAM and
IFS jobs @ 1 job per slot maximally efficient use of resource. Having equal size slots for both models permits either model to be queued and
run in the event of a job failure. Selected jobs given higher priority so that they continue to run ahead of others.
Machine partition: 7 slots of 2,560 cores = 17,920 cores out of 18,048 99% machine utilization 128 processors for pre- and post-processing and as spares (postpone reboot)
Lower resolution IFS experiments (T159 and T511) were run on Kraken IFS runs were initially made by COLA. When the ECMWF SMS model
management system was installed, runs could be made by COLA or ECMWF.
Project Athena Experiments
Initial and Boundary Data Preparation IFS:
Most input data prepared by ECMWF. Large files shipped by removable disk.
Time Slice experiment input data prepared by COLA.NICAM:
Initial data from GDAS 1° files. Available for all dates. Boundary files other than SST included with NICAM. SST from ¼° NCDC OI daily (version 2). Data starting 1 June
2002 include in situ, AVHRR (IR), and AMSR-E (microwave) . Earlier data does not include AMSR-E.
All data interpolated to icosahedral grid.
Post Processing, Data Selection and CompressionAll IFS (Grib-1) data interpolated (coarsened) to the N80
reduce grid for common comparison among the resolutions and with the ERA-40 data. All IFS spectral data truncated to T159 coefficients and transformed to N80 full grid.
Key fields at full model resolution were processed, including transforming spectral coefficients to grids and compression to NetCDF-4 via GrADS. Processing accomplished using Kraken, because Athena lacks
sufficient memory and computing power on each node. All the common comparison and selected high-resolution data
electronically transferred to COLA via bbcp (up to 40MB/s sustained).
Post Processing, Data Selection and Compression (cont’d)Nearly all (91) NICAM diagnostic variables saved. Each
variable saved with (2560) separate files for model domains, resulting in over 230,000 files. The number of files quickly saturated LFS.
Original program to interpolate data to regular lat-lon grid had to be revised to use less I/O and to multithread, thereby eliminating a processing backlog.
Selected 3-d fields were interpolated from z-coordinate to p-coordinate levels.
Selected 2- and 3-d fields were compressed (NetCDF-4) and electronically transferred to COLA.
All selected fields coarsened to N80 full grid.
Data Management: NICSAll data archived to HPSS approaching 1 PBWorkflow required complex data movement:
All model runs at high resolution done on AthenaModel output stored on scratch or nakji and all
copied to tape on HPSSIFS data interpolation/truncation done directly
from retrieved HPSS filesNICAM data processed using Verne and nakji
(more capable CPUs and larger memory)
Data Management: COLAAthena allocated 50TB (26%) on COLA disk
servers. Required considerable discussion and judgment to
down-select variables from IFS and NICAM, based on factors including scientific use and data compressibility.
Large directory structure needed to organize the data, particularly IFS with many resolutions, sub-resolutions, data forms and ensemble members.
Data Management: FutureNew machines at COLA and NICS will permit
further analysis not currently possible due to lack of memory and compute power.
Some or all of the data will be made publically available eventually when long term disposition is determined.TeraGrid Science Portal??Earth System Grid??
SummaryLarge, international team of climate and computer scientists, using
dedicated and shared resources, introduces many challenges for production computing, data analysis and data management
The shear volume and the complexity of the data, “breaks” everything: Disk capacity File name space Bandwidth connecting systems within NICS HPSS tape capacity Bandwidth to remote sites for collaborating groups Software for analysis and display of results (GrADS modifications)
COLA overcame these difficulties as they were encountered in 24×7 production mode and prevent having an idle dedicated computer.