10
Scalable Systems Software for Terascale Computer Centers www.scidac.org/ScalableSystems Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM Compaq SNL LANL Ames NCSA SGI Scyld Intel Unlimited Scale

Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Embed Size (px)

Citation preview

Page 1: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Scalable Systems Softwarefor Terascale Computer Centers

www.scidac.org/ScalableSystems

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBMCompaq

SNLLANLAmesNCSA

SGIScyldIntelUnlimited Scale

Page 2: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

The Problem Today

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

• Commercial solutions not happening because business forces drive industry towards servers not HPC.

www.scidac.org/ScalableSystems

System administrators and managers of terascale computer centers are facing a crisis:

Page 3: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Checkpointrestart

Scope of the Effort

Resource & QueueManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystems

Security

Allocationmanagement

Fault Tolerance

AllocationAllocationmanagementmanagement

Submit jobsSubmit jobsTo batch queueTo batch queue

Start parallel Start parallel processesprocesses

JobJobMonitoringMonitoring

CheckpointCheckpointrestartrestart

Page 4: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Goals

www.scidac.org/ScalableSystems

Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. The specification will proceed through a series of open meetings following a format similar to that used by the MPI forum.

Produce a fully integrated suite of systems software and tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities.

Research and development of more advanced versions of the components required to support the scalability, fault tolerance, and performance requirements of large science applications. 

Carry out a software lifecycle plan for support and maintenance of systems software suite.

Page 5: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Impact

www.scidac.org/ScalableSystems

Fundamentally change the way future high-end systems software is developed and distributed

Reduced facility management costs

• reduce need to support ad hoc software

• better systems tools available

• able to get machines up and running faster and keep running

More effective use of machines by scientific applications

• scalable launch of jobs and checkpoint/restart

• job monitoring and management tools

• allocation management interface

Page 6: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Four Working Groupsto interact with

www.scidac.org/ScalableSystems

1. Node build, configuration, and information service

2. Resource management, scheduling, and allocation

3. Proccess management, system monitoring, and checkpointing

4. Validation and Integration

• Allows groups to keep track of other groups progress and comment on the items of overlap

• Allows Center members and interested parties to see what is being defined and implemented

A main notebook for general information & mtg notes And individual notebooks for each working group

Electronic Notebooks keep WG on trackElectronic Notebooks keep WG on track

Page 7: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL

Interactions

Principle customers are sysadmin and supercomputer managers

CCA looks to Scalable Systems to provide services to launch parallel components on large systems and provide event services for fault detection and monitoring.

DOE Science GRID will be involved with the Scalable Systems through their integration of Grid tools with the monitoring and resource management services layer of the systems software

Applications using the terascale SciDAC resources including climate, accelerator design, and astrophysics, etc. will be utilizing job submission, job monitoring, user assisted checkpointing, and allocation tools developed by the Center.

Other organizations and vendors participating in the Scalable Systems effort even though not funded by SciDAC.

www.scidac.org/ScalableSystems

Page 8: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL
Page 9: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL
Page 10: Scalable Systems Software for Terascale Computer Centers  Coordinator: Al Geist Participating Organizations ORNL ANL LBNL