Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Scalable Systems Softwarefor Terascale Computer Centers

www.scidac.org/ScalableSystems

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBMCompaq

SNLLANLAmesNCSA

SGIScyldIntelUnlimited Scale

The Problem Today

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

• Commercial solutions not happening because business forces drive industry towards servers not HPC.


System administrators and managers of terascale computer centers are facing a crisis:

Checkpointrestart

Scope of the Effort

Resource & QueueManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring


Security

Allocationmanagement

Fault Tolerance

AllocationAllocationmanagementmanagement

Submit jobsSubmit jobsTo batch queueTo batch queue

Start parallel Start parallel processesprocesses

JobJobMonitoringMonitoring

CheckpointCheckpointrestartrestart

Goals


Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. The specification will proceed through a series of open meetings following a format similar to that used by the MPI forum.

Produce a fully integrated suite of systems software and tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities.

Research and development of more advanced versions of the components required to support the scalability, fault tolerance, and performance requirements of large science applications.

Carry out a software lifecycle plan for support and maintenance of systems software suite.

Impact


Fundamentally change the way future high-end systems software is developed and distributed

Reduced facility management costs

• reduce need to support ad hoc software

• better systems tools available

• able to get machines up and running faster and keep running

More effective use of machines by scientific applications

• scalable launch of jobs and checkpoint/restart

• job monitoring and management tools

• allocation management interface

Four Working Groupsto interact with


1. Node build, configuration, and information service

2. Resource management, scheduling, and allocation

3. Proccess management, system monitoring, and checkpointing

4. Validation and Integration

• Allows groups to keep track of other groups progress and comment on the items of overlap

• Allows Center members and interested parties to see what is being defined and implemented

A main notebook for general information & mtg notes And individual notebooks for each working group

Electronic Notebooks keep WG on trackElectronic Notebooks keep WG on track

Interactions

Principle customers are sysadmin and supercomputer managers

CCA looks to Scalable Systems to provide services to launch parallel components on large systems and provide event services for fault detection and monitoring.

DOE Science GRID will be involved with the Scalable Systems through their integration of Grid tools with the monitoring and resource management services layer of the systems software

Applications using the terascale SciDAC resources including climate, accelerator design, and astrophysics, etc. will be utilizing job submission, job monitoring, user assisted checkpointing, and allocation tools developed by the Center.

Other organizations and vendors participating in the Scalable Systems effort even though not funded by SciDAC.


Documents

Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL