25
A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Embed Size (px)

Citation preview

Page 1: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

A View from the TopPreparing for ReviewA View from the TopPreparing for Review

Al GeistFebruary 24-25

Chicago, IL

Page 2: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

www.scidac.org/ScalableSystems

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBM

SNLLANLAmesNCSA

CrayIntelUnlimited Scale

Participating OrganizationsParticipating Organizations

Main Web SiteMain Web Site

Page 3: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

IBMCrayIntelUnlimited Scale

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

Impact

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

• Reduced facility mgmt costs.• More effective use of machines

by scientific applications.

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystemsTo learn more visit

Page 4: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfaces

Working Components and Interfaces (bold)

authentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Progress so far on Integrated SuiteProgress so far on Integrated Suite

Validation & Testing

HardwareInfrastructure

Manager

Page 5: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Scalable Systems Software CenterOctober 10-11Houston TX

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Page 6: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Progress Reports at Oct. mtgProgress Reports at Oct. mtg

Al Geist – preparation for Supercomputing 2002, booth space, posters, demos

Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider

Demonstrations of Prototype ComponentsPrep for SC demo

Slides can be found in Main Notebook page 29

Page 7: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Accounting

FileSystem

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

Scheduler

User DBAllocationManagement

Process Manager

UsageReports

UserUtilities

HighPerformance

Communication& I/O

Application Environment

Meta Services

System &Job Monitor

Checkpoint /Restart

Grid Interfaces

Job QueueManager

TheseInterfaceTo all

NodeConfiguration

& BuildManager

Page 8: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Scalable Systems Software Center

November-February

Progress Since Last MeetingProgress Since Last Meeting

Page 9: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

SciDAC BoothSciDAC Booth

Page 10: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

SC2002 Systems PostersSC2002 Systems Posters

Page 11: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Five Project Notebooks filling upFive Project Notebooks filling up

A main notebook for general information

And individual notebooks for each working group

• Over 216 total pages – 20 added since last meeting

• A lot of XML scheme to comment on

• New subscription feature

Get to all notebooks through main web site www.scidac.org/ScalableSystems

Click on side bar or at “project notebooks” at bottom of page

Page 12: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Weekly Working Group Telecoms

Resource management, scheduling, and accounting

Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Validation and Testing (hasn’t met since last year)

Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157

Proccess management, system monitoring, and checkpointing

Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service

Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)

Page 13: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Scalable Systems Software Center

February 24-25, 2003

This MeetingThis Meeting

Page 14: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Agenda – February 24Agenda – February 24

8:30 Al Geist – Project Status. SciDAC PI mtg and External Project review 9:00 Matt Sottile – Science Appliance Project Working Group Reports 9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desai – Node Build, Configure 3.00 Break 3:30 Large Scale Run on Chiba

debugging components 5:00 Open Discussion of Review report 5:30 Adjourn Working groups may wish to hack in evening

Page 15: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Agenda – February 25Agenda – February 25

8:30 Discussion, proposals, straw votes

Write paper on each component Draft report in main notebook Comments on “restricted interface” XML shown by Rusty External review demo – can we?

10:30 Break11:00 Al Geist – Summary PI mtg talk and poster. External review agenda next meeting date: June 5&6 at Argonne. thank our hosts ANL12:00 meeting ends

Page 16: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

SciDAC PI mtg – all 50 projectsSciDAC PI mtg – all 50 projects

March10-11, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode

20 minute talk – presented by AlScalable Systems, CCA, PERC, SDM

Poster Presentation

Page 17: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

External SciDAC Review mtgExternal SciDAC Review mtg

March12-13, 2003 – Napa CaliforniaAttending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty)

Four ISIC Projects are reviewed separately – Scalable Systems, CCA, PERC, SDM

External review panel (8 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa [John Grosh?]

We owe them a Review report

Day 1 – Each gets 1 ¾ hours to present projectDay 2 – Each project gets grilled by panel for 1½ hrs

Page 18: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

External Review mtg AgendaExternal Review mtg Agenda

Wednesday, March 12 

7:45 Welcome, charge to reviewers 8:15 Plenary session for Common Component Architecture ISIC10: 00 Break10:15 Plenary session for Scalable Systems Software ISIC12:00 Reviewer caucus 12:15 Lunch  1:15 Plenary session for Scientific Data Management ISIC 3:00 Break 3:15 Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30 Adjourn

Page 19: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

External Review mtg AgendaExternal Review mtg Agenda

Thursday, March 13 8:00 Meetings between reviewers and ISIC members A.     Common Component Architecture B.     Scalable Systems Software 9:45 Break 10:00 Meetings between reviewers and ISIC members C.     Scientific Data Management D.     Performance Engineering11:45 Reviewer Caucus/End of ISIC Reviews12:15 Lunch (on your own) 1:15 Programming Models Review Session I 3:00 Break 3:15 Programming Models Review Session II 5:00 Programming Models Reviewer Caucus 5:30 Meeting adjourns

Page 20: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting Notes

Matt- Pink: a 1024 node science appliance. Provide pseudo SSI that scales to 1024. Tolerates failure. Singe point for management. Reduce boot and install time by x100. Reduce number of FTP per number of nodes.Science Appliance – very little in common with older linux.Software is called Clustermatic – linuxBIOS, Bproc, V9fs, supermon, Panasas or Lustre (parallel file system by someone else)Beoboot, asymmetric SSI, private name spaces from Plan 9,BJS (Bproc Job Scheduler)Other work – ZPL (automatic check point)Debuggers (parallel, relative debugging –Guard) port totalview.Latency tolerant applicationsUsers – SNL/CA, U Penn, ClemsonWhat are overlap opportunities? Each piece can be separated out. Supermon, Bproc Remy will be sending more material on collaboration soon

Page 21: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting Notes

Scott- RM update. Diagram of architecture and infrastructure servicesSc02 demo what components working. They used polling.Now moving to event driven componentsRelease of initial RM suite – from website http://sss.scl.ameslab.gov/software/ OpenPBS-sss 2.3.15-1 Maui scheduler 3.2.6 Qbank 2.10.4 (accounting system)SSSRMAP protocol using HTTP validatedScalability testing performed on all componentsScheduler progressQueue Manager progressAccounting and Allocation Manager progress (Qbank and Gold prototype)Meta-scheduler progress – Globus interface, Gold Information service.Next work Release 2 of RM interface Implement and test SSSRMAP security authentication (XML digital sigs)Discuss need to have SSS wrappers on initial RM suite

Page 22: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting Notes

Will- Validation and Testing update Users expect a high degree of quality in today’s HPC.StrategiesQMTest – RM group using it (www.codesourcery.com) They like it “easy”App test packages APITEST – growing out of October discussion C++ driven XML schema scriptable test of network components blackbox testing. Tcp, ssslib, portals support, fault injection whitebox testing. Try to exercise all paths in a known suite v0.1a underway 75% done Discussion how this could be useful to Scalable SystemsCluster Integration Toolkit (CIT) –James Laros [email protected] management tasks on Cplant – scalable to 1800 nodes done in Perl create Scalable Systems interface to CIT would be a good test of implementation of flexibility of standard. USI, IBM, and Linux Networx looking at it.

Page 23: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting Notes

Paul – Process management report. Moving beyond prototypes of:Checkpoint manager beta-code April release awaiting legal OK will do scalability test today working on XML interface for checkpoint/restart (draft in May) Mike - Monitoring – job, system, node, and meta-version what data is needed – an extensible framework defined stream and single item. working on scalability now Rusty - Process Manager schematic of PM component MPD-2 in python and distributed with MPICH-2 -supports separate executables, arguments, and environment variables New XML for PM (with queries that allow wildcards and ranges)Combination of published interfaces, XML, and communication lib gives us a power greater than the sum of its parts.

Page 24: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting NotesNarayan – Build and configure reportTests suggest scalability to 2000 host clustersCommunication Infrastructure more protocol support, high availability option.Build and configuration complete implementation on Chiba City second OSCAR implementation undreway three components - hardware manager (needs more modular, extensible design) - build system - node manager (admin control panel for a cluster) system diagnosticsRestriction Based Syntax for XML interfacesAPI augmentation APIs need more documentation to describe event handling protocol

Page 25: A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL

Meeting NotesMeeting NotesJohn Dawson asks about license. Al says like MPI.Don (Cray) asks about license !GNU and holding a workshop for industryTalk with Remy about Science Appliance collaborationTalk with Rusty about writing a paper on each component.

Groups Work on large scalability test on Chiba City and XTORC