Larry Marx and the Project Athena Team. Outline Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary Data

Larry Marx and the Project Athena Team

OutlineProject Athena ResourcesModels and Machine UsageExperimentsRunning ModelsInitial and Boundary Data PreparationPost Processing, Data Selection and CompressionData Management

Kraken

8256 nodes@ 12

cores, 16 GB mem

Kraken

8256 nodes@ 12

cores, 16 GB mem

scratch78 TB

(Lustre)

homes8 TB

(NFS)

nakji360 TB(Lustre)

Athena

4512 nodes @ 4 cores, 2 GB mem

Verne5 nodes

@ 32 cores, 128 GB mem

Verne5 nodes

@ 32 cores, 128 GB mem

Dedicated, Oct’09 – Mar’1079 million core-hours

Dedicated, Oct’09 – Mar’10post-processing

Shared, Oct’09 – Mar’105 million core-hours

Read-only

800+ TB HPSS tape archive

Models and Machine Usage NICAM initially was the primary focus of implementation

Limited flexibility in scaling, due to icosahedral grid Limited testing on multicore/cache processor architectures;

production primarily on the vector-parallel (NEC SX) Earth Simulator

Step 1: Port low resolution version with simple physics to Athena Step 2: Determine highest resolution possible on Athena and

minimum and maximum number of cores to be used Unique solution: G-level = 10 or 10,485,762 cells (7-km spacing)

using exactly 2,560 cores Step 3: Initially NICAM jobs failed frequently due to improper

namelist settings. During visit by U. Tokyo and JAMSTEC scientists to COLA, new settings determined that generally ran with little trouble. However 2003 could never be stabilized and was abandoned.

Models and Machine Usage (cont’d) IFS flexible scalability sustains good performance for higher resolution

configurations (T1279 and T2047) using 2,560 processor cores We defined one “slot” as 2,560 cores and managed a mix of NICAM and

IFS jobs @ 1 job per slot maximally efficient use of resource. Having equal size slots for both models permits either model to be queued and

run in the event of a job failure. Selected jobs given higher priority so that they continue to run ahead of others.

Machine partition: 7 slots of 2,560 cores = 17,920 cores out of 18,048 99% machine utilization 128 processors for pre- and post-processing and as spares (postpone reboot)

Lower resolution IFS experiments (T159 and T511) were run on Kraken IFS runs were initially made by COLA. When the ECMWF SMS model

management system was installed, runs could be made by COLA or ECMWF.

Project Athena Experiments

Initial and Boundary Data Preparation IFS:

Most input data prepared by ECMWF. Large files shipped by removable disk.

Time Slice experiment input data prepared by COLA.NICAM:

Initial data from GDAS 1° files. Available for all dates. Boundary files other than SST included with NICAM. SST from ¼° NCDC OI daily (version 2). Data starting 1 June

2002 include in situ, AVHRR (IR), and AMSR-E (microwave) . Earlier data does not include AMSR-E.

All data interpolated to icosahedral grid.

Post Processing, Data Selection and CompressionAll IFS (Grib-1) data interpolated (coarsened) to the N80

reduce grid for common comparison among the resolutions and with the ERA-40 data. All IFS spectral data truncated to T159 coefficients and transformed to N80 full grid.

Key fields at full model resolution were processed, including transforming spectral coefficients to grids and compression to NetCDF-4 via GrADS. Processing accomplished using Kraken, because Athena lacks

sufficient memory and computing power on each node. All the common comparison and selected high-resolution data

electronically transferred to COLA via bbcp (up to 40MB/s sustained).

Post Processing, Data Selection and Compression (cont’d)Nearly all (91) NICAM diagnostic variables saved. Each

variable saved with (2560) separate files for model domains, resulting in over 230,000 files. The number of files quickly saturated LFS.

Original program to interpolate data to regular lat-lon grid had to be revised to use less I/O and to multithread, thereby eliminating a processing backlog.

Selected 3-d fields were interpolated from z-coordinate to p-coordinate levels.

Selected 2- and 3-d fields were compressed (NetCDF-4) and electronically transferred to COLA.

All selected fields coarsened to N80 full grid.

Data Management: NICSAll data archived to HPSS approaching 1 PBWorkflow required complex data movement:

All model runs at high resolution done on AthenaModel output stored on scratch or nakji and all

copied to tape on HPSSIFS data interpolation/truncation done directly

from retrieved HPSS filesNICAM data processed using Verne and nakji

(more capable CPUs and larger memory)

Data Management: COLAAthena allocated 50TB (26%) on COLA disk

servers. Required considerable discussion and judgment to

down-select variables from IFS and NICAM, based on factors including scientific use and data compressibility.

Large directory structure needed to organize the data, particularly IFS with many resolutions, sub-resolutions, data forms and ensemble members.

Data Management: FutureNew machines at COLA and NICS will permit

further analysis not currently possible due to lack of memory and compute power.

Some or all of the data will be made publically available eventually when long term disposition is determined.TeraGrid Science Portal??Earth System Grid??

SummaryLarge, international team of climate and computer scientists, using

dedicated and shared resources, introduces many challenges for production computing, data analysis and data management

The shear volume and the complexity of the data, “breaks” everything: Disk capacity File name space Bandwidth connecting systems within NICS HPSS tape capacity Bandwidth to remote sites for collaborating groups Software for analysis and display of results (GrADS modifications)

COLA overcame these difficulties as they were encountered in 24×7 production mode and prevent having an idle dedicated computer.

Documents

Larry Marx and the Project Athena Team. Outline Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary Data