Upload
barnard-turner
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Quality in Chaos: a view from the TeraGrid
environment
John TownsTeraGrid Forum Chair
Director of Persistent InfrastructureNational Center for Supercomputing Applications
University of [email protected]
with the assistance of many TeraGrid colleagues!!
2 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What is Cyberinfrastructure?
• Computing systems,
• data storage systems, and data repositories,
• visualization environments,
• and people,
• all linked together by high performance networks.
3 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
The Vision of TeraGrid
• Three part mission:– support the most advanced computational science in multiple
domains– empower new communities of users– provide resources and services that can be extended to a broader
cyberinfrastructure• TeraGrid is…
– an advanced, nationally distributed, open cyberinfrastructure comprised of supercomputing, storage, and visualization systems, data collections, and science gateways, integrated by software services and high bandwidth networks, coordinated through common policies and operations, and supported by computing and technology experts, that enables and supports leading edge scientific discovery and promotes science and technology education
– a complex collaboration of over a dozen organizations and NSF awards working together to provide collective services that go beyond what can be provided by individual institutions
4 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What is TeraGrid?(simple definition)
A complex collaboration of over a dozen organizations working together to provide cyberinfrastructure
that goes beyond what can be provided by individual
institutions,
to improve research productivity and enable breakthroughs not otherwise possible.
5 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
TeraGrid Objectives
• DEEP Science: Enabling Petascale Science– make science more productive through an integrated
set of very-high capability resources• address key challenges prioritized by users
• WIDE Impact: Empowering Communities– bring TeraGrid capabilities to the broad science
community• partner with science community leaders - “Science Gateways”
• OPEN Infrastructure, OPEN Partnership– provide a coordinated, general purpose, reliable set of
services and resources• partner with campuses and facilities
6 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What you can do with the TeraGrid:Simulation of cell membrane processes
Work by Emad Tajkhorshid and James Gumbart, of University of Illinois Urbana-Champaign. – Mechanics of Force Propagation in
TonB-Dependent Outer Membrane Transport. Biophysical Journal 93:496-504 (2007).
– Results of the simulation may be seen at www.life.uiuc.edu/emad/TonB-BtuB/btub-2.5Ans.mpg
• Modeled mechanisms for transport of molecules through cell membrane.
• Used 400,000 CPU hours [45 processor-years] on systems at National Center for Supercomputing Applications, IU, Pittsburgh Supercomputing Center
Image courtesy of Emad Tajkhorshid, UIUC
7 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
TG App: SCEC-PSHA
• Part of Southern California Earthquake Center (Tom Jordan, USC)
• Using large scale simulation data, estimate probablistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period)
• Used by hospitals, power plants, schools, etc. as part of their risk assessment
• For each location, need a Cybershake run followed by roughly 840,000 parallel short jobs– parallelize across locations, not individual
workflows
• Completed over 300 locations to date, targeting 2000 sites in 2010
Managing these requires effective grid workflow tools for job
submission, data management and error recovery, using Pegasus (ISI)
and DAGman (Wisconsin)
7
Information/image courtesy of Phil Maechling
8 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What is the TeraGrid?
• An instrument that delivers high-end IT resources/services: computation, storage, visualization, and data/services– a computational facility – over a PetaFLOP in parallel computing
capability– a data storage and management facility - over 20 PetaBytes of storage
(disk and tape), over 100 scientific data collections– a high-bandwidth national data network
• A service: help desk and consulting, Advanced Support for TeraGrid Applications (ASTA), education and training events and resources
• Something you can use without financial cost – research accounts allocated via peer review– Startup and Education accounts automatic
• World’s largest distributed cyberinfrastructure for scientific research– supported by National Science Foundation
9 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
SDSC
TACC
UC/ANL
NCSA
ORNL
PU
IU
PSCNCAR
Caltech
USC/ISI
UNC/RENCI
UW
Resource Provider (RP)
Software Integration Partner
Grid Infrastructure Group (UChicago)
11 Resource Providers, One Facility
NICS
LONI
Network Hub
10 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
TeraGrid Resources and Services• Computing
– more than one petaflop of computing power today and growing•500 Tflop Ranger (Sun Constellation) at Texas Advanced Computing Center (TACC)
•1.03 PFlop Kraken (Cray XT5) at National Institute for Computational Sciences (NICS), University of Tennessee
• Remote visualization servers and software– 60 TFlop condor-based viz resource at Purdue University
• Data – allocation of data storage facilities – over 100 Scientific Data Collections
• Central allocations process • Technical Support
– central point of contact for support of all systems– Advanced Support for TeraGrid Applications (ASTA)– education and training events and resources– over 30 Science Gateways
11 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
How is TeraGrid Organized?• TG is set up like a large cooperative research group
– evolved from many years of collaborative arrangements between the centers
– still evolving!• Federation of 12 awards
– Resource Providers (RPs)• provide the computing, storage, and visualization resources
– Grid Infrastructure Group (GIG)• central planning, reporting, coordination, facilitation, and management group
• Strategically lead by the TeraGrid Forum– made up of the PI’s from each RP and the GIG– led by the TG Forum Chair, who is responsible for coordinating the group
(elected position)• John Towns – TG Forum Chair
– responsible for the strategic decision making that affects the collaboration
• Day-to-Day Functioning via Working Groups (WGs):– each WG under a GIG Area Director (AD), includes RP representatives
and/or users, and focuses on a targeted area of TeraGrid
12 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Impacting Many Agencies
NSF
DOE
NIH
NASA
DOD
International
University
Other
Industry
NSF52%
DOE13%
NIH19%
NASA 10%
DOD1%
International0%
University2% Other
2%
Industry1%
NSF49%
DOE11%
NIH15%
NASA 9%
DOD5%
International3%
University1%
Other6%
Industry1%
Supported Research Funding by Agency
Resource Usage by Agency
$91.5M Direct Support of Funded Research
10B NUs Delivered
13 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
So why are you here anyhow??
• For moderate scale research projects funded by federal agencies, quality is an afterthought– $10s of millions/year– well.. perhaps just assumed as implicitly needed– no explicit treatment of quality in many programs
• Of course, large scale projects have quality as a first class concern– $100s of millions/year– DOD has recognizes importance of quality in modeling and
simulation efforts• specifically designed verification, validation, and accreditation (VV&A) processes
• understand the simulation’s capabilities, limitations, and performance relative to the real-world objects it simulates
• http://vva.msco.mil/– NSF MREFC planning processes have quality concerns stated in
solicitations for these projects
14 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
TeraGrid is no exception
• Initially defined largely as a technology research activity with intent to support production (academic definition) resources
• Behaved organizationally much like an individual investigator research team– lack of clear structure and processes
• In the end, TeraGrid is both operations and research– operations:
• facilities/services on which researchers rely• infrastructure on which other providers build
– research:• learning how to do distributed, collaborative science on a global, federated infrastructure
• learning how to run multi-institution shared infrastructure
• Further, lack of recognition of what TeraGrid really is– an emerging and evolving infrastructure for enabling science and
engineering– (initially) treated as a research project
• Thus, something of a “perfect storm”
15 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
but….
• TeraGrid has become quite successful anyhow– the picture was perhaps not so bleak
• participant centers embodied a great deal of experience and expertise
• no lack of vision (perhaps too much) or passion amongst participants
• we came to some basic realizations
• Fundamentally, we had to mature as a distributed infrastructure organization– while we provided many technically interesting
things, we had lost sight of the “quality” of what we provided• we had to understand what that meant!
16 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Quality in a TeraGrid Context
• What did this mean for us?– TeraGrid must deliver important services reliably
and without barriers to entry to a community of scientists and engineers not interested in the nerdy details we TeraGrid geeks loved to wallow in…
17 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
TeraGrid faced many challenges on this front …
• Relied on a software we obtained elsewhere and had little control over– “academic grade” quality was a generous description for much of it
• We integrated this software along with many services into a distributed environment– resources at various site governed by conflicting policies– software often not based on standards or did not comply with them
• The distributed organization presented many faces to the user community for many of the services provided– participants desire to maintain their own identity while playing nice
in the larger environment• TeraGrid had (has) many organizational challenges
– no strong central management/authority– participants frequently pitted against one another in life/death
funding competitions• And the list goes on…
18 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What TeraGrid needed to do…
• Create a more stable distributed environment and facilitate use by the user community– institute basic quality assurance mechanisms
• Quality Assurance working group– increased stability/reliability of software infrastructure
• Inca system, – new interfaces to environment
• Science Gateways, workflow support
• Reduce the number of faces presented to the user community– reduce electronic interfaces
• User Portal, POPS, trouble ticket submission• create common user environment across multiple heterogeneous systems
– reduce “human faces”• centralized helpdesk, integrated/coordinated advanced support functions
• Focus on facilitating use and not new technology development– support for new and advanced users – understand the challenges users face in our environment
19 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Something Important Going for Us
• TeraGrid was not a revolutionary idea suddenly instituted as a project– built on a long history of NSF-funded supercomputing centers
• initially funded in 1985– a progression of NSF programs
• a handful of major centers funded early on• some loose collaboration of those centers through 1980’s and 1990s• first NSF program to fund collections of centers in 1997• evolution of that program to TeraGrid
• This provided an important resource– staff with a passion for delivering resources and services to support
science an engineering– a culture of striving to do our best in this developed
• But…– most staff were subject matter experts and not process driven– we regularly work with cutting edge technologies
• no luxury of spending 2 years developing software using traditional software engineering processes
20 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Creating a stable and reliable environment: QA Working Group
• Goal: improve reliability of production TeraGrid software components/services
• Increase reliability of services:– prioritize testing/debugging of services most relevant
to users– identify existing tests to be used and/or develop new
tests• improving the use of the Inca monitoring framework
• increase availability of CTSS services:– improve time from detected failure to notification– map errors to potential problem resolution procedures
• Develop/propose a more formal process for CTSS software deployment
21 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Creating a stable and reliable environment: Build & Test Facility
• Lower software build and support costs across providers
• Improve software quality
• Make software builds reproducible
• Faster software turnaround time
• Provide public access to software manufacture process
TG Submit Host
build.teragrid.org
TeraGrid Software
SourcesBinaries
NMI Build/Test Framework
Condor / Condor-G / DAGMan
TG Build Pool
NMI Framework
Wrapper
Condor Startd
TG Software Build Scripts
TG Build Tools
TeraGrid Recipes
Build ScriptsBuild Specs
Build Results DB
Build Job InfoCentral framework
Evolved TG component
New TG component
22 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Reducing the number of electronic faces• TeraGrid User Portal
– access RP resources and special multi-site services
– current, up-to-date information about TG environment
– manage and monitor allocations via common tools
– first line of support for users• documentation, information about
hardware and software resources • education, outreach and training events
and resources
• Common User Environment Working Group– remove barriers to user movement between
TeraGrid resources– coordinate with RP staff and TG WG
• CUE Management System, CUE Build Environment, CUE Testing Platform, CUE Variable Collection
• Science Gateways– user access without allocation
request• simplifies access to resources• immediate reach to communities of researchers
23 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
What is a Science Gateway?
• A Science Gateway– enables scientific communities of
users with a common scientific goal– uses high performance computing– has a common interface – leverages community investment
• Three common forms:– web-based portals – application programs running on
users' machines but accessing services in TeraGrid
– coordinated access points enabling users to move seamlessly between TeraGrid and other grids
24 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
How can a Gateway help?
• Make science more productive– researchers use same tools– complex workflows– common data formats– data sharing
• Bring TeraGrid capabilities to the broad science community– lots of disk space– lots of compute resources– powerful analysis
capabilities– nice interface to information
25 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Support for New and Established Users
• TeraGrid Advanced Support for Applications (ASTA)– request help with code optimization, workflow improvement and
gateways through• TeraGrid Pathways
– new user support, mentoring, fellowships• Campus Champions
– individuals at your institution to offer support• HPC University
– online public resources• TeraGrid Annual Conference
– showcases capabilities, achievements and impact of TeraGrid in research
– presentations, demos, posters, visualizations
– tutorials, training and peer support– student competitions and volunteer opportunities
26 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Understanding the Challenges User Face
• Established User Interaction Council– key group of project leaders chaired by Director of
Science• Regular analysis of trouble tickets to identify
problem areas– leverage expertise and experience of other staff in
resolving• often results in an agreement among support teams at the 11 RPs how to (better, faster) resolve problems in future
– relevant insights are promptly reflected in the online materials • documentation, User Portal, Knowledge Base
– cross-cutting operational issues identified and reported to the User Interaction
27 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
But what did we learn?
• We did many things to improve the quality of the product we delivered to our customers– established many practices and procedures
• adopted many formal software engineering practices– improved the user experience in using our
resources and services• paid attention to the experiences our users had in making use of the environment
• But these were not the heart of what has made us successful
28 2010 ASQ World Conference on Quality and Improvement
May 24-26, 2010, St. Louis, MO
Its all about the people!
• Staff with a passion to produce a quality product in the form of integrated software, services and resources– who were willing to go beyond tradition research
activities to attain the goal• Staff with a passion to enable the work of scientists
and engineers– with the expertise in the use of advanced technologies
• Staff with a vision for excellence– who connected with our user community on many levels
Never underestimate the value of the staff working on your projects!