25
Resource Management Resource Management Working Group Working Group SSS Quarterly Meeting SSS Quarterly Meeting November 28, 2001 November 28, 2001 Dallas, Tx Dallas, Tx

Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Embed Size (px)

Citation preview

Page 1: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Resource Management Resource Management Working GroupWorking Group

SSS Quarterly MeetingSSS Quarterly Meeting

November 28, 2001November 28, 2001

Dallas, TxDallas, Tx

Page 2: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Resource Management and Resource Management and Accounting Working GroupAccounting Working Group

• Working group scope and Working group scope and componentscomponents

• Progress madeProgress made

• Current and future issuesCurrent and future issues

• Next stepsNext steps

Page 3: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Working Group ScopeWorking Group Scope

The Resource Management Working Group The Resource Management Working Group encompasses the areas of resource management, encompasses the areas of resource management, scheduling and accounting.scheduling and accounting.

This working group will focus on the following software This working group will focus on the following software components:components:

• Queue ManagerQueue Manager• SchedulerScheduler• Allocation ManagerAllocation Manager• Meta SchedulerMeta SchedulerOur charter will also encompass the following capabilities:Our charter will also encompass the following capabilities:• AccountingAccounting• Usage ReportsUsage Reports

Page 4: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Phase 1 MilestonesPhase 1 Milestones

• 6 months:6 months: Contribute to checkpoint/restart report with Contribute to checkpoint/restart report with regard to scheduling related aspects regard to scheduling related aspects

• 12 months: Establish and release initial resource 12 months: Establish and release initial resource management interface specificationsmanagement interface specifications

• 12 months: Establishment of the CVS repository and 12 months: Establishment of the CVS repository and module structure, agreement on document module structure, agreement on document conventionsconventions

• 12 months: Finalized API for system initiated 12 months: Finalized API for system initiated checkpoint/restart of parallel MPI jobs on Linux systemscheckpoint/restart of parallel MPI jobs on Linux systems

• 18 months: Release v1.0 of the Center’s resource 18 months: Release v1.0 of the Center’s resource management system based on existing open source management system based on existing open source code and the results of the scalability testing.code and the results of the scalability testing.

Page 5: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

High Level ProgressHigh Level Progress

• Establishing high level design covering initial Establishing high level design covering initial component functionality and required component functionality and required interfacesinterfaces

• Determining inter-group requirements (GUI, Determining inter-group requirements (GUI, security, IS, process management, etc)security, IS, process management, etc)

• Preparing existing tools (Maui, Silver, QBank) Preparing existing tools (Maui, Silver, QBank) for use within SSSfor use within SSS

• Creating infrastructure within which to Creating infrastructure within which to develop and test RM deliverablesdevelop and test RM deliverables

• Creating infrastructure within which to Creating infrastructure within which to develop and test intra- and inter-group develop and test intra- and inter-group interfacesinterfaces

Page 6: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Proposed Component Proposed Component ArchitectureArchitecture

QueueManager

AllocationManager

Collector

MetaScheduler

Scheduler

NodeManager

ProcessManager

SecuritySystem

InformationService

DiscoveryService

Color Color KeyKey

Working GroupWorking Group

Resource Resource Management and Management and AccountingAccounting

Execution Execution Management and Management and MonitoringMonitoring

Node Config and Node Config and InfrastructureInfrastructure

Page 7: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Component Interaction Component Interaction DiagramDiagramJob submitted to Queue Job submitted to Queue ManagerManager

UserInterface

CollectorMetaScheduler

Queue Manager

Allocation Manager

Scheduler ProcessManager

21

34

65

7

98

1011

Page 8: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Component Interaction Component Interaction TraceTraceJob submitted to Queue Job submitted to Queue ManagerManager

1. A user submits a job to the Queue Manager2. The Queue Manager does a sanity balance check with the Bank3. The Queue Manager notifies the Scheduler that a new job has arrived4. The Scheduler queries node and job status until job can run5. A bank reservation is made with the Allocation Manager6. The Scheduler requests the Queue Manager to run the job7. The Queue Manager passes job control to the Process Manager8. The Process Manager notifies Queue Manager of job completion9. The Queue Manager notifies Scheduler of job completion10. A bank withdrawal is made with the Allocation Manager11. The user is notified of job completion

Page 9: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Component Interaction Component Interaction Diagram Diagram Job submitted to Meta Job submitted to Meta SchedulerScheduler

UserInterface

CollectorMetaScheduler

Queue Manager

Allocation Manager

Scheduler ProcessManager

21

34

65

87

109

11

1312

1415

Page 10: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Component Interaction Component Interaction TraceTraceJob submitted to Meta Job submitted to Meta SchedulerScheduler

1. A user submits a job to the Meta Scheduler2. The Meta Scheduler contacts Schedulers to determine which systems

could run the job the soonest3. The Schedulers request quotes from Allocation Banks to determine

which systems would run the job for the lowest cost4. A Scheduler reservation is created for the job on the resource

providing the best service -- this reservation can be moved or improved upon until the job is staged

5. The job is staged and queued at the system where it is to run6. The Queue Manager notifies the Scheduler that a new job has arrived7. The Scheduler queries node and job status until job can run8. A bank reservation is made with the Allocation Manager9. The Scheduler requests the Queue Manager to run the job10. The Queue Manager passes job control to the Process Manager11. The Process Manager notifies Queue Manager of job completion12. The Queue Manager notifies Scheduler of job completion13. A bank withdrawal is made with the Allocation Manager14. The Scheduler notifies the Meta Scheduler of job completion15. The user is notified of job completion

Page 11: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Design/Interface ProgressDesign/Interface Progress

• Initial high level RMS architecture definedInitial high level RMS architecture defined• Resource management dictionary created Resource management dictionary created

defining objects within resource management defining objects within resource management ‘world’‘world’

• Object ‘tokens’ declared for major objectsObject ‘tokens’ declared for major objects• Component functional interfaces identifiedComponent functional interfaces identified• Initial XML request/response syntax proposedInitial XML request/response syntax proposed• Prototypes being constructed to test Prototypes being constructed to test

communication protocolscommunication protocols• Initial detailed extra-group component Initial detailed extra-group component

requirements document createdrequirements document created

Page 12: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Local Scheduler RationaleLocal Scheduler Rationale

Local interfaces with majority of inter and intra Local interfaces with majority of inter and intra RM componentsRM components

Establish test platform from which interfaces can Establish test platform from which interfaces can be testedbe tested

Leverage existing capabilities to accelerate SSS Leverage existing capabilities to accelerate SSS development development

Establish infrastructure within which scheduling Establish infrastructure within which scheduling and metascheduling services and capabilities and metascheduling services and capabilities can be developedcan be developed

Establish ‘driver’ to evaluate other resource Establish ‘driver’ to evaluate other resource management componentsmanagement components

Page 13: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Local Scheduler ProgressLocal Scheduler Progress

• Baseline scheduler established (Maui 3.2) for SSS Baseline scheduler established (Maui 3.2) for SSS scheduling services integrating production and scheduling services integrating production and development capabilitiesdevelopment capabilities

• Prototype interface enabling XML communication Prototype interface enabling XML communication with queue manager, metascheduler, and node with queue manager, metascheduler, and node managermanager

• Extended QoS infrastructure integratedExtended QoS infrastructure integrated• Extended Job prioritization infrastructure Extended Job prioritization infrastructure

integratedintegrated• Prototype created for object-oriented data accessPrototype created for object-oriented data access• Advanced metascheduling interface integratedAdvanced metascheduling interface integrated

Page 14: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Meta Scheduler ProgressMeta Scheduler Progress

• Initial distribution packaging created to Initial distribution packaging created to allow collaborative developmentallow collaborative development

• Documentation enhanced and extendedDocumentation enhanced and extended

• Prototype XML scheduler to Prototype XML scheduler to metascheduler query interface metascheduler query interface developeddeveloped

• Initial fault tolerance framework Initial fault tolerance framework designeddesigned

Page 15: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Queue Manager DesignQueue Manager Design

• Established need for unified queue manager Established need for unified queue manager design common to Scheduler and design common to Scheduler and MetaschedulerMetascheduler

• Queue manager will interface directly with Queue manager will interface directly with Process managerProcess manager

• In process of refining the queue manager In process of refining the queue manager taskstasks

• Queue manager will provide an interface to Queue manager will provide an interface to obtain information about any job regardless of obtain information about any job regardless of job state including completed jobs (i.e. it will job state including completed jobs (i.e. it will maintain a job information archive)maintain a job information archive)

Page 16: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Allocation Manager ProgressAllocation Manager Progress

• QBank placed under revision controlQBank placed under revision control

• Java prototype created which sends Java prototype created which sends requests in XMLrequests in XML

• Experimenting with protocol Experimenting with protocol frameworks (simple octet-counting, frameworks (simple octet-counting, octet-stuffing, SOAP, BEEP)octet-stuffing, SOAP, BEEP)

Page 17: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Next Steps (In Progress)Next Steps (In Progress)

• Software Lifecycle InfrastructureSoftware Lifecycle Infrastructure– Online intra-RM schedule and dependencies documentOnline intra-RM schedule and dependencies document– Detailed extra-RM working group requirements Detailed extra-RM working group requirements – Coordinate creation of component level regression test Coordinate creation of component level regression test

suite suite – Bug tracking systems activated (used to track internal Bug tracking systems activated (used to track internal

defects and development plans)defects and development plans)• InterfaceInterface

– Produce validating intra-RM XML schema Produce validating intra-RM XML schema – Produce prototype RM components communicating in Produce prototype RM components communicating in

initial protocolinitial protocol• Feature EnhancementsFeature Enhancements

– Contribution to checkpoint/restart reportContribution to checkpoint/restart report– Creation of queue manager prototypeCreation of queue manager prototype

Page 18: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Next Steps (6 Months)Next Steps (6 Months)

• UsabilityUsability– GUI-server interface, GUI format, security determined and GUI-server interface, GUI format, security determined and

prototypes createdprototypes created– Documentation of initial meta job constraints/features and Documentation of initial meta job constraints/features and

specification languagespecification language• Inter-group CollaborationInter-group Collaboration

– Creation of early scheduler XML implementation for use as RM Creation of early scheduler XML implementation for use as RM driverdriver

– Development of initial dynamic job scheduler-queue manager Development of initial dynamic job scheduler-queue manager interfaceinterface

– Extension of RM specifications/requirement documentExtension of RM specifications/requirement document– Extension of internal component test infrastructureExtension of internal component test infrastructure– Determination of ‘best practices’ in documentation maintenanceDetermination of ‘best practices’ in documentation maintenance– Evaluation and adoption of web project management and Evaluation and adoption of web project management and

collaboration toolscollaboration tools– Creation of prototype queue manager with scheduler/task Creation of prototype queue manager with scheduler/task

manager interfaces manager interfaces

Page 19: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Next Steps (6 Months)Next Steps (6 Months)

• Fault ToleranceFault Tolerance– Enhance metascheduler to ‘survive’ local daemon failureEnhance metascheduler to ‘survive’ local daemon failure– Enhancement of threaded scheduling interface. Enhancement of threaded scheduling interface. – Development of threaded metascheduling interface.Development of threaded metascheduling interface.

• Resource OptimizationResource Optimization– Development of local optimization features of meta workloadDevelopment of local optimization features of meta workload

• Feature EnhancementsFeature Enhancements– Creation of resource manager extension features.Creation of resource manager extension features.– Development of direct metascheduler to queue manager Development of direct metascheduler to queue manager

staging roadmap.staging roadmap.• InterfacesInterfaces

– Specification of ‘best guess’ security infrastructure and Specification of ‘best guess’ security infrastructure and evaluation of impact on system internals and communication evaluation of impact on system internals and communication protocolsprotocols

Page 20: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Next Steps (1 year)Next Steps (1 year)

• Software Lifecycle InfrastructureSoftware Lifecycle Infrastructure– Create multi-component regression testsCreate multi-component regression tests– Generate ‘alpha’ package of scheduling, metascheduling, and Generate ‘alpha’ package of scheduling, metascheduling, and

allocation management packages.allocation management packages.• InterfacesInterfaces

– Development of functional XML interfaces for all componentsDevelopment of functional XML interfaces for all components– Early adoption of security infrastructureEarly adoption of security infrastructure– Creation of optional information service interfacesCreation of optional information service interfaces– Admin and end-user GUI’s proposed to enable use of new Admin and end-user GUI’s proposed to enable use of new

functionalityfunctionality• Inter-group CollaborationInter-group Collaboration

– Enhanced suspend/resume and checkpoint/restart features Enhanced suspend/resume and checkpoint/restart features with detailed roadmap specified for all remaining with detailed roadmap specified for all remaining suspend/resume and checkpoint restart deliverablessuspend/resume and checkpoint restart deliverables

Page 21: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Current IssuesCurrent Issues

• Should there be an enveloping protocol Should there be an enveloping protocol framework which handles framing (where the framework which handles framing (where the XML document begins and ends), authentication, XML document begins and ends), authentication, multiplexing, streaming data, etc? (should we multiplexing, streaming data, etc? (should we look at something like BEEP, or start from scratch look at something like BEEP, or start from scratch and invent something of our own?)and invent something of our own?)

• The queue manager/collector to node/process The queue manager/collector to node/process manager functionality and data interface requires manager functionality and data interface requires further refinement.further refinement.

• Queue manager/collector and node/process Queue manager/collector and node/process manager development schedules must be manager development schedules must be determined and coordinated.determined and coordinated.

Page 22: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

IssuesIssues

• Continued effort is required to complete Continued effort is required to complete an ‘intra-RM’ XML schema to handle initial an ‘intra-RM’ XML schema to handle initial RMS interaction needs. Boundaries RMS interaction needs. Boundaries between internal ‘intra-RM’ and global between internal ‘intra-RM’ and global XML schema is needed.XML schema is needed.

• Understanding of open source Understanding of open source requirements (I.e. can software be requirements (I.e. can software be included in SSS distribution that requires included in SSS distribution that requires registration and usage agreements) registration and usage agreements)

Page 23: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Inter-Group IssuesInter-Group Issues

• Need for coordination of resource Need for coordination of resource management system across working groups management system across working groups – so that the pieces all function together – so that the pieces all function together properly and no part is overlooked. Need to properly and no part is overlooked. Need to coordinate schedules for delivery of RMWG-coordinate schedules for delivery of RMWG-dependent non-RMWG components.dependent non-RMWG components.

• Early vendor/industry collaborations (We’d Early vendor/industry collaborations (We’d better do this while it can still influence our better do this while it can still influence our design. Need to talk to decision makers and design. Need to talk to decision makers and develop business plans)develop business plans)

Page 24: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

Inter-group IssuesInter-group Issues

•Information service – should we rather Information service – should we rather be looking for something existing? (i.e. be looking for something existing? (i.e. MDS2)MDS2)

•Need to solidify SSS-wide standards for Need to solidify SSS-wide standards for packaging, revision control, packaging, revision control, documentation content, format, and documentation content, format, and packaging, problem tracking, … and packaging, problem tracking, … and establish mechanisms and places to establish mechanisms and places to home them.home them.

•Creation of regression and integration Creation of regression and integration test suite (w/ Validation and Testing WG – test suite (w/ Validation and Testing WG – we need this from an early stage)we need this from an early stage)

Page 25: Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

ConclusionsConclusions

• Questions…Questions…