Upload
nora-fox
View
217
Download
2
Embed Size (px)
Citation preview
Scalable Systems Softwarefor Terascale Computer Centers
www.scidac.org/ScalableSystems
Coordinator: Al Geist
Participating Organizations
ORNLANLLBNLPNNL
PSCSDSCIBMCompaq
SNLLANLAmesNCSA
SGIScyldIntelUnlimited Scale
The Problem Today
• Computer centers use incompatible, ad hoc set of systems tools
• Present tools are not designed to scale to multi-Teraflop systems
• Commercial solutions not happening because business forces drive industry towards servers not HPC.
www.scidac.org/ScalableSystems
System administrators and managers of terascale computer centers are facing a crisis:
Checkpointrestart
Scope of the Effort
Resource & QueueManagement
Accounting& user mgmt
SystemBuild &Configure
Job management
SystemMonitoring
www.scidac.org/ScalableSystems
Security
Allocationmanagement
Fault Tolerance
AllocationAllocationmanagementmanagement
Submit jobsSubmit jobsTo batch queueTo batch queue
Start parallel Start parallel processesprocesses
JobJobMonitoringMonitoring
CheckpointCheckpointrestartrestart
Goals
www.scidac.org/ScalableSystems
Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. The specification will proceed through a series of open meetings following a format similar to that used by the MPI forum.
Produce a fully integrated suite of systems software and tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities.
Research and development of more advanced versions of the components required to support the scalability, fault tolerance, and performance requirements of large science applications.
Carry out a software lifecycle plan for support and maintenance of systems software suite.
Impact
www.scidac.org/ScalableSystems
Fundamentally change the way future high-end systems software is developed and distributed
Reduced facility management costs
• reduce need to support ad hoc software
• better systems tools available
• able to get machines up and running faster and keep running
More effective use of machines by scientific applications
• scalable launch of jobs and checkpoint/restart
• job monitoring and management tools
• allocation management interface
Four Working Groupsto interact with
www.scidac.org/ScalableSystems
1. Node build, configuration, and information service
2. Resource management, scheduling, and allocation
3. Proccess management, system monitoring, and checkpointing
4. Validation and Integration
• Allows groups to keep track of other groups progress and comment on the items of overlap
• Allows Center members and interested parties to see what is being defined and implemented
A main notebook for general information & mtg notes And individual notebooks for each working group
Electronic Notebooks keep WG on trackElectronic Notebooks keep WG on track
Interactions
Principle customers are sysadmin and supercomputer managers
CCA looks to Scalable Systems to provide services to launch parallel components on large systems and provide event services for fault detection and monitoring.
DOE Science GRID will be involved with the Scalable Systems through their integration of Grid tools with the monitoring and resource management services layer of the systems software
Applications using the terascale SciDAC resources including climate, accelerator design, and astrophysics, etc. will be utilizing job submission, job monitoring, user assisted checkpointing, and allocation tools developed by the Center.
Other organizations and vendors participating in the Scalable Systems effort even though not funded by SciDAC.
www.scidac.org/ScalableSystems