View
216
Download
1
Category
Tags:
Preview:
Citation preview
Progress on Release, API Discussions,Vote on APIs, and PI mtg
Progress on Release, API Discussions,Vote on APIs, and PI mtg
Al GeistJanuary 14-15, 2004
Chicago, ILL
Coordinator: Al Geist
Participating Organizations
ORNLANLLBNLPNNL
PSCSDSCIBMSGI
SNLLANLAmesNCSA
CrayIntelUnlimited Scale
Participating OrganizationsParticipating Organizations
Changes
IBMCrayIntelSGI
Scalable Systems SoftwareScalable Systems Software
Participating Organizations
ORNLANLLBNLPNNL
NCSAPSCSDSC
SNLLANLAmes
• Collectively (with industry) define standard interfaces between systems components for interoperability
• Create scalable, standardized management tools for efficiently running our large computing centers
Problem
Goals
• Computer centers use incompatible, ad hoc set of systems tools
• Present tools are not designed to scale to multi-Teraflop systems
ResourceManagement
Accounting& user mgmt
SystemBuild &Configure
Job management
SystemMonitoring
www.scidac.org/ScalableSystems
To learn more visit
Potential Impact of ProjectPotential Impact of Project
Fundamentally change the way future high-end systems software is developed and distributed
Reduced facility management costs
• reduce need to support ad hoc software
• better systems tools available
• able to get machines up and running faster and keep running
More effective use of machines by scientific applications
• scalable launch of jobs and checkpoint/restart
• job monitoring and management tools
• allocation management interface
Grid Interfaces
Accounting
Event Manager
ServiceDirectory
MetaScheduler
MetaMonitor
MetaManager
SchedulerNode StateManager
AllocationManagement
Process Manager
UsageReports
Meta Services
System &Job Monitor
Job QueueManager
NodeConfiguration
& BuildManager
Standard XML
interfaces
Working Components and Interfaces (bold)
authentication communication
Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite
Checkpoint /Restart
Scalable Systems Software SuiteScalable Systems Software Suite
Validation & Testing
HardwareInfrastructure
Manager
First Release at SC2003First Release at SC2003
Packaging&
Install
Scalable Systems Software CenterSeptember 11-12Washington DC
Review of Last MeetingReview of Last Meeting
Details inMain project notebook
Highlights from Sept. mtgHighlights from Sept. mtgRusty Lusk – Using SSS as the production systems software on Chiba City for a number of months now. Use restriction syntax for everything. Got blessing of ANL sysadmin group.
Scott Jackson – Standard Error reporting and codes across components. Discuss dividing up code space in consistent way.
Eric Debenedictus – Issues for peta-scale systemsRedstorm and Bluelight mesh rather than switch means that topology is important consideration for SSS to consider:XML attribute to specify topology and I/O resourcesXML attribute to specify data arrangement on diskOS functionality hints to help auto placement
Thomas Naughton – SSS deployment using OSCARA release of OSCAR that contains all SSS softwareRoll SSS components into OSCAR packages – RPM formatCreate repository for OSCAR package uploads
Highlights from Sept. mtg (cont.)Highlights from Sept. mtg (cont.)
Al Geist – Plans for SC2003
Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider
Long Term Strategy – Get Computer Centers involved and using suiteGet vendors to be compliant with APIs
Slides can be found in Main Notebook
Consensus and Voting:Consensus and Voting:
Communication Infrastructure SpecWire protocols – need to add security envelope protocolAdded service location. Bootstrapped using /etc/sss/Vote to Accept as spec for •Wire Protocol definition to get new ones accepted•Service Directory interface•Event Manager interface Second vote: 16 yes 2 abstaining 0 no
Agreement for having common error objects with 3 digit codes and messages. Message is human readable string. Two special ones 000 success 999 unknownStraw vote: 15 no 1 Abs 0
Al suggests these general error classes – success, warning, temp failure, partial failure, failure
People need to come up with counter proposal if they care
Scalable Systems Software Center
September-January
Progress Since Last MeetingProgress Since Last Meeting
Systems Software Suite ReleaseSystems Software Suite Release
Open Source License – Fred asks that we come up with one general text that all organizations can agree on and then he will bless it. DONE
SSS-OSCAR – Packaging done of all components (working around those components with license issues)
First Release – Announced at SC2003. Available from project web site www.scidac.org/ScalableSystems
SC2003 Scalable System Demos and TalksSC2003 Scalable System Demos and Talks
Rusty – fancy dancing meatball in wxpythonThomas – SSS-OSCAR working Will – fancy graphic demonstration of APITest ????Brett – demonstrate swapping components in SSS architecture Paul – chkpoint interacting with PM on chiba
Locations: All Across the show floor
SciDAC booth – Talks by Rusty, Craig
OSCAR BOF on Tuesday 5:00-6:00 mentions SSS-OSCAR
Five Project NotebooksFive Project Notebooks
A main notebook for general information
And individual notebooks for each working group
• Over 297 total pages – 16 added since last meeting
• BC and PM groups need to get specs into their notebooks
• Add Telecom meeting notes even if short
Get to all notebooks through main web site www.scidac.org/ScalableSystems
Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group TelecomsStarting back up after Holidays
Resource management, scheduling, and accounting
Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”
Validation and Testing Group
No need for telecoms recently
Proccess management, system monitoring, and checkpointing
Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910
Node build, configuration, and information service
Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
Scalable Systems Software Center
January 14-15, 2004
This MeetingThis Meeting
Major Topics this MeetingMajor Topics this Meeting
Stability of Systems Software Suite – first release is out. Are we ready for a more robust second release
Large Scale test run – NCSA has dedicated some time tonight to run our suite on their 1250 dual node cluster
Quarterly Report Due – would like to get one to Fred by end of January. Will need text from WG leaders.
Formal API presentations and voting - it is that time in the project when we are finalizing on some APIs.
SciDAC PI Mtg - March 22-24 in Charleston SC. We will need poster(s), talk, and 2 page summary document
Agenda - January 15Agenda - January 15 8:30 Al Geist – Project Status. 9:15 Thomas Naughton – SSS OSCAR software suite release Working Group Reports
Progress report on what their group has done API Proposals for adoption by the groupProgress on software suite improvements
9:30 Narayan Desai – Node Build, Configure10:30 Break11:00 Will McClendon – Validation and Testing 12:00 Lunch (on own – cafeteria room B) 1:00 Paul Hargrove – Process Management 2:00 Scott Jackson – Resource Management 3.00 Break 3:30 Narayan - Review of "restriction syntax" style of XML 4:00 Rusty - Discussion of restriction syntax for scheduler and queue mgr 4:30 Craig – Brief on on big testbed run 5:00 Eric – competitive system to SSS 5:30 Adjourn Evening Working groups may want to help with large NCSA test run
Agenda – January 16Agenda – January 16
8:30 Discussion, proposals, votes
Rusty - Process Manager API (discussion/vote) Narayan - Node state API (discussion/vote) Scott – Allocation Manager API (discussion/vote) Brett – Queue manager API (discussion/vote) Scott – SSSRMAP interface Al - Progress report Al - SciDAC mtg 2 pager, posters, talks
10:30 Break11:00 Al Geist – Summary SciDAC PI Mtg March 22-24, Charleston SC next meeting date: May 13-14
location: Argonne
12:00 meeting ends
Meeting notesMeeting notes
Al presents his slidesThomas Naughton – SSS deployment using OSCARGood – RPMs created for all SSS components! OSCAR packaging (varying levels) SourceForge project supplied central CVS locationBad – not all scripts are created equal – new untested submissions Some pain getting SF accounts. Time constraints forced script hacks OSCAR testing framework Status – Tarball available fairly toxic but builds full working cluster w/ SSS Updated OSCAR pkg HowToToDo – clean up hacks, integrate remaining SSS components (qbank) Add SSS interface to OSCAR itselfWould like to establish release schedule – March 1 Not clear that anyone has downloaded yetDiscussion of how many orgs in our group could shakedown the tarball Group feels better to have few very reliable components than all components
Meeting notesMeeting notes
Narayan – node build progress reportOnly had a few minor bug fixesInfrastructure has been reliable for 6 monthLibrary updates: Portability - OSX support, 64-bit tested, Tru64 support Thread-safety SSL wire-protocol module – soon to be the default protocol in ssslibNode state manager – reliableBuild System – building vs configuration interface/conflict issuesHardware infrastructure – model needs refinement WRT topology infoRestriction Syntax augmentations New operators added – negations, numeric, regular expression Integrated into all python componentsNext steps – work on new models for hardware infrastructure Work on multiple implementations of BCM components Performance tuning – for ssslib, event manager, service directory
Meeting notesMeeting notes
Will McLnedon – Component Interface testing reportDescription of his work for the new folksSC2003 demo of APItest v.1 in ASCI booth (GUI HTTP interface) built on Twisted Framework www.twistedmatrix.com Db interfacing, distributed component testing, HTTPD modeAPItest development. Lessons learned. V.2 new test file formats – collab with Jackson separate individual tests from batch grouping Runs through some examples.Feedback is encouragedHope to get some real test suites going this quarter
Ron Oldfield – introduced
Shows graphical APItest demo that was given at SC2003
Meeting notesMeeting notes
Paul Hargrove – Process management reportSSS-OSCAR releaseComing to a point where components have to interact more eg. ChkptReal deployment/testing on Chiba (ANL), XTORC (ORNL)Checkpoint manager – progress ported to RH9 (hard – Red Hat kernel’s…) checkpoint using LAM/MPI stand-alone package w/ LAM/MPI for chkpt suspend/resume interface working with queue managerOutstanding issues – need to design restart-time interactions need to implement a full interface - restriction syntax, event generation, error reporting basic ideas on file managementMonitoring progress in SSS-OSCAR Scalability work – thread pool, internal protocol changes fix service directory connections write documentation
Meeting notesMeeting notes
Process manager (cont)Rusty Lusk – Process Manager functionality overviewShow Schematic of process management componentsVarious commands that are in the syntaxProgress – already a stable component, fixed several bugs at SC03Improved queries and error codesFuture INTEGRATION! Stable software makes this possible Chiba production use has forced the issue Continued development
Meeting notesMeeting notes
Scott Jackson – Resource Manager reportShort overview for new attendeesProgress – released in SSS-OSCAR Bamboo, Maui, Gold, Warehouse Updated RM web page for new components being available Deployed user oriented problem response system Created SSSRMAP C-implementation module Completed per-component interface documentsSchedule Progress - Completed chkpt/restart based SSS calls. blocked until can test with checkpoint guys - support for dynamic jobs blocked until support provided in PM and QM discussion of feature of dynamic jobs how/if we should work on it - resource limit enforcement and tracking need rusage on process exit blocked until support from PM and QM progressToo much blocking seems RM group lacks coordination with other groups.
Meeting notesMeeting notes
Scott Jackson – Resource Manager report (cont)Initial release of Bamboo and wrote API documentAccounting and allocationQbank was an initial solution replaced by GoldGold – released under BSD open source licence packaged as tarball. And initial OSCAR rpm created added support for Service Directory registration implemented status codes implemented instance-level role-based authorizationGold running on 11 TF cluster at PNNLGUI improved to include user, project, machine management viewsMeta-scheduler – added thread support improved Silver installation procedure testing of (grid level) data stagingFuture- draft of SSSRMAP v3 protocol spec (chunking) release alpha versions of Bamboo, Maui, Gold, Warehouse complete design spec documents for above components.
Meeting notesMeeting notes
Discussion of having two XML syntax styles (functional, object)Al says he would like to see one common one across the suitethat he didn’t care which one as long as the whole group could agree.Rusty brought up a second issue, wire protocol, and having a single library that has all the protocols used by the components in theSSS suite.
Narayan – Restriction Syntax OverviewCommand syntax – incorporates imperative and database operations allows uniform data queries across components easy to process improves atomicity of operationsSemantics – Examples given going across attributes are ANDed and multple lines are OredAn issue of uniqueness was brought up and will be taken into consideration by Narayan.
Meeting notesMeeting notes
Rusty – Restriction Syntax on Chiba CityDavid would like to see a paper of the requirements that the Chibaeffort required.Narayan – Hack of quick interfaces for Queue ManagerRestriction Interface has 4 commands (add, del, run, get)Doesn’t show Scheduler Interface
Craig – 1280 dual xeon cluster “Titanium” is available this eveningTo test the scalability of SSS suite. One node will be used asHead node to install our suite and run on entire cluster.Could build everything but Bambo and ssslib due to XersesWill begin to be available at 6pm
Eric – A competing package. From his Russian “secret city” trip Oct. 03Package for - Distributed calculations, metacomputing, Grid.System is based on XML, web-based user interface,Configure, manage, and submit jobs. Challenges auto load balance.
Meeting notesMeeting notes
Late night session on 1280 node testbedPM ran at 1280 worked at 4000, hung at 6000Warehouse had a problem at 1280 and took out head nodeRM components ran on head node OK until Warehouse crashed it
Rusty – Process Manager Spec for first votePresentation and discussion…Who is responsible for limited enforcement PM or QM? I.e.Must use certain amount of memory, must not execute OS command(in general - things that happen after fork)Rusty says the question is good and he needs to think about How this may affect the interface.Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readabilityDelay vote until we have a written proposal.
Meeting notesMeeting notes
How to write spec to describe how XML should be extended to future needs.
Narayan – Node State Manager spec (no written doc so no vote)Presentation and lots of discussion…
Scott – Allocation Manager spec (has written doc in notebook)Goes through examples in the document. Discussion.Switches to discussion of comparison between both XML syntaxAnd Andrew Lusk thinks that a translator could be created for queries(but not for output) Rusty thinks it is a bad idea and feelsIt is not problem to have two syntax.David says the translation is good because it could buy time to switch syntax
Andrew and Paul and Craig offer to help build a prototype translatorTo see how / if it is possible.
Investigate standardization of tokens across the two syntax
Meeting notesMeeting notes
How
Recommended