Upload
jerome-cole
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
A View from the TopPreparing for ReviewA View from the TopPreparing for Review
Al GeistFebruary 24-25
Chicago, IL
www.scidac.org/ScalableSystems
Coordinator: Al Geist
Participating Organizations
ORNLANLLBNLPNNL
PSCSDSCIBM
SNLLANLAmesNCSA
CrayIntelUnlimited Scale
Participating OrganizationsParticipating Organizations
Main Web SiteMain Web Site
IBMCrayIntelUnlimited Scale
Scalable Systems SoftwareScalable Systems Software
Participating Organizations
ORNLANLLBNLPNNL
NCSAPSCSDSC
SNLLANLAmes
• Collectively (with industry) define standard interfaces between systems components for interoperability
• Create scalable, standardized management tools for efficiently running our large computing centers
Problem
Goals
Impact
• Computer centers use incompatible, ad hoc set of systems tools
• Present tools are not designed to scale to multi-Teraflop systems
• Reduced facility mgmt costs.• More effective use of machines
by scientific applications.
ResourceManagement
Accounting& user mgmt
SystemBuild &Configure
Job management
SystemMonitoring
www.scidac.org/ScalableSystemsTo learn more visit
Grid Interfaces
Accounting
Event Manager
ServiceDirectory
MetaScheduler
MetaMonitor
MetaManager
SchedulerNode StateManager
AllocationManagement
Process Manager
UsageReports
Meta Services
System &Job Monitor
Job QueueManager
NodeConfiguration
& BuildManager
Standard XML
interfaces
Working Components and Interfaces (bold)
authentication communication
Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite
Checkpoint /Restart
Progress so far on Integrated SuiteProgress so far on Integrated Suite
Validation & Testing
HardwareInfrastructure
Manager
Scalable Systems Software CenterOctober 10-11Houston TX
Review of Last MeetingReview of Last Meeting
Details inMain project notebook
Progress Reports at Oct. mtgProgress Reports at Oct. mtg
Al Geist – preparation for Supercomputing 2002, booth space, posters, demos
Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider
Demonstrations of Prototype ComponentsPrep for SC demo
Slides can be found in Main Notebook page 29
Accounting
FileSystem
Event Manager
ServiceDirectory
MetaScheduler
MetaMonitor
MetaManager
Scheduler
User DBAllocationManagement
Process Manager
UsageReports
UserUtilities
HighPerformance
Communication& I/O
Application Environment
Meta Services
System &Job Monitor
Checkpoint /Restart
Grid Interfaces
Job QueueManager
TheseInterfaceTo all
NodeConfiguration
& BuildManager
Scalable Systems Software Center
November-February
Progress Since Last MeetingProgress Since Last Meeting
SciDAC BoothSciDAC Booth
SC2002 Systems PostersSC2002 Systems Posters
Five Project Notebooks filling upFive Project Notebooks filling up
A main notebook for general information
And individual notebooks for each working group
• Over 216 total pages – 20 added since last meeting
• A lot of XML scheme to comment on
• New subscription feature
Get to all notebooks through main web site www.scidac.org/ScalableSystems
Click on side bar or at “project notebooks” at bottom of page
Weekly Working Group Telecoms
Resource management, scheduling, and accounting
Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”
Validation and Testing (hasn’t met since last year)
Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157
Proccess management, system monitoring, and checkpointing
Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910
Node build, configuration, and information service
Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
Scalable Systems Software Center
February 24-25, 2003
This MeetingThis Meeting
Agenda – February 24Agenda – February 24
8:30 Al Geist – Project Status. SciDAC PI mtg and External Project review 9:00 Matt Sottile – Science Appliance Project Working Group Reports 9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desai – Node Build, Configure 3.00 Break 3:30 Large Scale Run on Chiba
debugging components 5:00 Open Discussion of Review report 5:30 Adjourn Working groups may wish to hack in evening
Agenda – February 25Agenda – February 25
8:30 Discussion, proposals, straw votes
Write paper on each component Draft report in main notebook Comments on “restricted interface” XML shown by Rusty External review demo – can we?
10:30 Break11:00 Al Geist – Summary PI mtg talk and poster. External review agenda next meeting date: June 5&6 at Argonne. thank our hosts ANL12:00 meeting ends
SciDAC PI mtg – all 50 projectsSciDAC PI mtg – all 50 projects
March10-11, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode
20 minute talk – presented by AlScalable Systems, CCA, PERC, SDM
Poster Presentation
External SciDAC Review mtgExternal SciDAC Review mtg
March12-13, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty)
Four ISIC Projects are reviewed separately – Scalable Systems, CCA, PERC, SDM
External review panel (8 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa [John Grosh?]
We owe them a Review report
Day 1 – Each gets 1 ¾ hours to present projectDay 2 – Each project gets grilled by panel for 1½ hrs
External Review mtg AgendaExternal Review mtg Agenda
Wednesday, March 12
7:45 Welcome, charge to reviewers 8:15 Plenary session for Common Component Architecture ISIC10: 00 Break10:15 Plenary session for Scalable Systems Software ISIC12:00 Reviewer caucus 12:15 Lunch 1:15 Plenary session for Scientific Data Management ISIC 3:00 Break 3:15 Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30 Adjourn
External Review mtg AgendaExternal Review mtg Agenda
Thursday, March 13 8:00 Meetings between reviewers and ISIC members A. Common Component Architecture B. Scalable Systems Software 9:45 Break 10:00 Meetings between reviewers and ISIC members C. Scientific Data Management D. Performance Engineering11:45 Reviewer Caucus/End of ISIC Reviews12:15 Lunch (on your own) 1:15 Programming Models Review Session I 3:00 Break 3:15 Programming Models Review Session II 5:00 Programming Models Reviewer Caucus 5:30 Meeting adjourns
Meeting NotesMeeting Notes
Matt- Pink: a 1024 node science appliance. Provide pseudo SSI that scales to 1024. Tolerates failure. Singe point for management. Reduce boot and install time by x100. Reduce number of FTP per number of nodes.Science Appliance – very little in common with older linux.Software is called Clustermatic – linuxBIOS, Bproc, V9fs, supermon, Panasas or Lustre (parallel file system by someone else)Beoboot, asymmetric SSI, private name spaces from Plan 9,BJS (Bproc Job Scheduler)Other work – ZPL (automatic check point)Debuggers (parallel, relative debugging –Guard) port totalview.Latency tolerant applicationsUsers – SNL/CA, U Penn, ClemsonWhat are overlap opportunities? Each piece can be separated out. Supermon, Bproc Remy will be sending more material on collaboration soon
Meeting NotesMeeting Notes
Scott- RM update. Diagram of architecture and infrastructure servicesSc02 demo what components working. They used polling.Now moving to event driven componentsRelease of initial RM suite – from website http://sss.scl.ameslab.gov/software/ OpenPBS-sss 2.3.15-1 Maui scheduler 3.2.6 Qbank 2.10.4 (accounting system)SSSRMAP protocol using HTTP validatedScalability testing performed on all componentsScheduler progressQueue Manager progressAccounting and Allocation Manager progress (Qbank and Gold prototype)Meta-scheduler progress – Globus interface, Gold Information service.Next work Release 2 of RM interface Implement and test SSSRMAP security authentication (XML digital sigs)Discuss need to have SSS wrappers on initial RM suite
Meeting NotesMeeting Notes
Will- Validation and Testing update Users expect a high degree of quality in today’s HPC.StrategiesQMTest – RM group using it (www.codesourcery.com) They like it “easy”App test packages APITEST – growing out of October discussion C++ driven XML schema scriptable test of network components blackbox testing. Tcp, ssslib, portals support, fault injection whitebox testing. Try to exercise all paths in a known suite v0.1a underway 75% done Discussion how this could be useful to Scalable SystemsCluster Integration Toolkit (CIT) –James Laros [email protected] management tasks on Cplant – scalable to 1800 nodes done in Perl create Scalable Systems interface to CIT would be a good test of implementation of flexibility of standard. USI, IBM, and Linux Networx looking at it.
Meeting NotesMeeting Notes
Paul – Process management report. Moving beyond prototypes of:Checkpoint manager beta-code April release awaiting legal OK will do scalability test today working on XML interface for checkpoint/restart (draft in May) Mike - Monitoring – job, system, node, and meta-version what data is needed – an extensible framework defined stream and single item. working on scalability now Rusty - Process Manager schematic of PM component MPD-2 in python and distributed with MPICH-2 -supports separate executables, arguments, and environment variables New XML for PM (with queries that allow wildcards and ranges)Combination of published interfaces, XML, and communication lib gives us a power greater than the sum of its parts.
Meeting NotesMeeting NotesNarayan – Build and configure reportTests suggest scalability to 2000 host clustersCommunication Infrastructure more protocol support, high availability option.Build and configuration complete implementation on Chiba City second OSCAR implementation undreway three components - hardware manager (needs more modular, extensible design) - build system - node manager (admin control panel for a cluster) system diagnosticsRestriction Based Syntax for XML interfacesAPI augmentation APIs need more documentation to describe event handling protocol
Meeting NotesMeeting NotesJohn Dawson asks about license. Al says like MPI.Don (Cray) asks about license !GNU and holding a workshop for industryTalk with Remy about Science Appliance collaborationTalk with Rusty about writing a paper on each component.
Groups Work on large scalability test on Chiba City and XTORC