33
March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March 22-24, 2004 Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the University of California for the U.S. Department of Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos National Laboratory strongly supports academic freedom and a researcher’s right to publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness. A-UR-04-0388 proved for public release; stribution is unlimited

March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

Embed Size (px)

Citation preview

Page 1: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

1

The Coming Crisis in Computational Science

Douglass PostLos Alamos National Laboratory

SCIDAC Workshop

Charleston, South Carolina, March 22-24, 2004

Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the University of California for the U.S. Department of Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos National Laboratory strongly supports academic freedom and a researcher’s right to publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness.

LA-UR-04-0388Approved for public release;Distribution is unlimited

Page 2: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

2

Outline

• Introduction• The Performance and Programming

Challenges• The Prediction Challenge• “Lessons Learned from ASCI”

– Case study lessons– Quantitative Estimation– Verification and Validation– Software Quality

• Conclusions and Path Forward

Page 3: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

3

Introduction

• The growth of computing power over the last 50 years has enabled us to address and “solve” many important technical problems for society

• Codes contain Realistic models, good spatial and temporal resolution, realistic geometries, realistic physical data, etc.

G. Gisler et al-Impact of Dinosaur Killer Asteroid

L. Winter et al-Rio Grande Watershed

Stanford ASCI Alliance—Jet Engine Simulation

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

U. Of Illinois ASCI Alliance—Shuttle Rocket Booster Simulation

Page 4: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

4

Three ChallengesPerformance, Programming and Prediction

• Performance Challenge-Building powerful computers

• Programming Challenge—Programming for Complex Computers– Rapid code development– Optimize codes for good performance

• Prediction Challenge—Developing predictive codes with complex scientific models– Develop codes that have reliable predictive capability

• Verification• Validation• Code Project Management and Quality

Page 5: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

5

• Introduction• The Performance and Programming

Challenges• The Prediction Challenge• “Lessons Learned from ASCI”

– Case study lessons– Quantitative Estimation– Verification and Validation– Software Quality

• Conclusions and Path Forward

Outline

Page 6: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

6

Programming Challenge

• Modern Computers are complicated machines– 1000’s of processors linked by a intricate set of networks for data

transfer– Terabytes of data

• Challenges:– Memory bandwidth not keeping up with processor speed– Memory management—cache utilization– Interprocessor communication and message passing conflicts– Reliability—fault tolerance

Page 7: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

7

High Performance Computing Community must address Programming Challenge.

• Problem is severe:– Code development time is often longer than platform turnaround

time, many codes have a 20 year or longer lifetime, with a 5 to 10 or more year replacement time

– Performance is often poor and not likely to improve– Trade-off between portability and performance favors portability

• “Moore” needs to be done to simplify application code development

• Better tools, language enhancements for easier parallelization, stable architectures are needed

• Addressing Programming Challenge is a major thrust of the DARPA High Productivity Computing Systems Project (HPCS).

Page 8: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

8

• Introduction• The Performance and Programming

Challenges• The Prediction Challenge• “Lessons Learned from ASCI”

– Case study lessons– Quantitative Estimation– Verification and Validation– Software Quality

• Conclusions and Path Forward

Outline

Page 9: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

9

Predictive Challenge is even more serious than Programming Challenge.

• Programming Challenge is a matter of efficiency– Programming for more complex computers takes longer

and is more difficult and takes more people, but with enough resources and time, codes can be written to run on these computers

• But the Predictive Challenge is a matter of survival:– If the results of the complicated application codes

cannot be believed, then there is no reason to develop the codes or the platforms to run them on.

Page 10: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

10

Many things can be wrong with a computer generated prediction.

• Experimental and theoretical science are mature methodologies but computational science is not.

• Code could have bugs in either the models or the solution methods that result in answers that are incorrect – e.g. 2+2=54.22, sin(90O)= 673.44, etc.

• Models in the code could be incomplete or not applicable to problem or have wrong data– E.g. climate models without an ocean current model

• User could be inexperienced, not know how to use the code correctly– CRATER analysis of Columbia Shuttle damage

• Two examples: Columbia Space Shuttle, Sonoluminescence

Page 11: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004 11

Lessons Learned are important

• 4 stages of design maturity for a methodology to mature—Henry Petroski—Design Paradigms

• Suspension bridges—case studies of failures (and successes) were essential for reaching reliability and credibility

1

2

3

4

Case studies conducted after each crash.Lessons learned identified and adopted by community

Tacoma Narrows Bridge buckled and fell 4 months after construction!

Page 12: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

12

Computational Science is at the third stage.

• Computational Science is in the midst of the third stage. • Prior generations of code developers were deeply scared of

failure, didn’t trust the codes.• New generation of code developers trained as computational

scientists• New codes are more complex and more ambitious but not as

closely coupled to experiments and theory– Disasters occurring now

• We need to assess the successes and failures, develop “lessons learned” and adopt them

• Otherwise we will fail to fulfill the promise of computational science

• Computational science has to develop the same professional integrity as theoretical and experimental science

Page 13: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

13

Computational analysis of Columbia Space Shuttle damage was flawed.

• The Columbia Space Shuttle wing failed during re-entry due to hot gases entering a portion of the wing damaged by a piece of foam that broke off during launch

• Shortly after launch, Boeing did an analysis using the code CRATER to determine likelihood that the wing was seriously damaged

• Problems with analysis:– The analysis was carried out by an inexperienced user, someone who had only used the code once

before– CRATER was designed to study the effects of micrometeorite impacts, and had been validated only

for projectiles less that 1/400 the size and mass of the piece of foam that struck the wing– Didn’t use a code like LS-DYNA that was the industry standard for assessing impact damage

• The prior CRATER validation results indicated that the code gave conservative predictions• Analysis indicated that there might be some damage, but probably not at a level to warrant

concern• Concerns due to CRATER analysis were downplayed by middle management and not passed

up the system.• Result was that no one tried hard to look at the wing and figure out how to do something to

avoid the crash (maybe there was no way to fix it).

Columbia breakup

Validation object

Foam “objects”

— NASA Columbia Shuttle Accident Report

Columbia re-entry

Page 14: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

14

Computational predictions led us astray with “sonoluminescent fusion”

• Taleyarkhan and co-workers at ORNL used intense sound waves to form sonoluminescent bubbles with deuterated acetone

• They observed Tritium formation and 14 MeV neutrons, indicating that nuclear fusion was occurring

• “Observed” Temperature was ~ 107 oK instead of the usual 103 oK• If true, we could have a fusion reactor in every house• Computer simulations were done that matched the “experimental results” if

the driving pressure in the codes was increased by a factor of 10 (well outside reasonable uncertainties)

• Based on the “agreement” of “theory” and “experiment”, the results were published in Science and generated intense interest

• Experiments were repeated, especially at ORNL. No significant Tritium or 14 MeV neutrons were found.

• “Sigh” — No fusion reactor in everyone’s house.• Simulation was misleading• Bad experiment + bad simulation = ?

—Taleyarkhan et al, Science 295(2002), p. 1868; Shapira, et al, PRL, 89(2002), p.104302.

Page 15: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

15

• Introduction• The Performance and Programming

Challenge The Prediction Challenge• “Lessons Learned from ASCI”

– Case study lessons– Quantitative Estimation– Verification and Validation– Software Quality

• Conclusions and Path Forward

Outline

Page 16: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

16

ASCI• In late 1996, the DOE launched the Accelerated Strategic

Computing Initiative (ASCI) to develop the“enhanced” predictive capability by 2004 at LANL, LLNL and SNL that was required to certify the US nuclear stockpile without testing– ASCI codes were to have much better physics, better resolution

and better materials data– Need a 105 increase in computer power from 1995 level– Develop massively parallel platforms (20 TFlops at LANL this year,

100 TFlops at LLNL in 2005-2006)– ASCI included development of applications, development and

analysis tools, massively parallel platforms, operating and networking systems and physics models

• ~ $6 B expended so far

Page 17: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

17

Lessons Learned

• Build on successful code development history and prototypes • Highly competent and motivated people in a good team are essential • Risk identification, management and mitigation are essential• Software Project Management: Run the code project like a project• Schedule and resources are determined by the requirements• Customer focus is essential

– For code teams and for stakeholder support

• Better physics is much more important than better computer science• Use modern but proven Computer Science techniques,

– Don’t make the code project a Computer Science research project

• Provide training for the team• Software Quality Engineering: Best Practices rather than Processes• Validation and Verification are essential

Page 18: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

18

LLNL and LANL had no “big code project” experience before ASCI

• But they did have a lot of successful small team experience

• Code teams of 1 to 5 staff developed multi-physics codes one module at a time and then integrated them

• ASCI needed rapid code development on an accellerated time scale

• LLNL and LANL launched large code projects with very mixed success

• Didn’t look at the “lessons learned” from other communities

Page 19: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

19

Teams

• Tom DeMarco states that there are four essentials of good management:– “Get the right people– Match them to the right jobs– Keep them motivated.– Help their teams to jell and stay jelled.

(All the rest is Administrivia.)”— “The Deadline”

• Success is all about Teams!• Management’s key role is the support and nurturing

of teams

Crestone ProjectTeam

T. DeMarco, 2000; DeMarco and Lister, 1999; Cockburn and Highsmith, 2001; Thomsett, 2002; McBreen, 2001

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Page 20: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

20

• Tom DeMarco lists five major risks for software projects:1. Uncertain or rapidly changing Requirements, Goals and Deliverables

• Almost always fatal

2. Inadequate resources or schedule to meet the requirements3. Institutional turmoil, including lack of management support for code project team,

rapid turnover, unstable computing environment, etc.4. Inadequate reserve and allowance for requirements creep and scope changes5. Poor Team performance. To these we add two:6. Inadequate support by stakeholder groups that need to supply essential modules,

etc.7. Problem is too hard to be solved within existing constraints

• Poor team performance is usually blamed for problems – But, all risks but #5 are the responsibility of management!

• ASCI experience: Management attention to 1—4, 6,7 has been inadequate.• Risk mitigation by contingency, back-up staffs and activities is key

Risk identification, management and mitigation are essential

T. DeMarco, 2002a

Page 21: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

21

Software Project Management

• Good organization of the work is essential• Good project management is more important for

success than good Software Quality Aassurance• Manage the code project as a project

– Clearly defined deliverables, a work breakdown structure for the tasks, a schedule and a plan tied to resources

• Execute the plan, monitor and track progress with quantitative metrics, re-deploy resources to keep the project on track as necessary

• Insist on support from sponsors and stakeholders• Project leader must control the resources,

otherwise the “leader” is just a “cheerleader”!Brooks, 1987; Remer, 2000; Rifkin, 2002; Thomsett, 2002; Highsmith, 2001

Page 22: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

22

Planning Software Projects is more restrictive than planning conventional projects

– Normally, one can pick two out of requirements, schedule and resources, and the third is then determined

– For software, the requirements determine the optimal schedule and the optimal resources. You can do worse, but not better than optimal.

• Schedule– The rate of software development, like all knowledge based work, is limited by the

speed that people can think, analyze problems and develop solutions.

• Resources– The size of the code team is determined by the number of people who can

coordinate their work together

• Specifying the schedule and/or resources plus the requirements has been one of the greatest problems for ASCI (and for many other code development projects in our experience).

• Then the code development plan is over-specified.

Requirements, Schedule and Resources must be consistent.

Schedule

ResourcesRequirements

D. Remer, 2000; T. Jones, 1988; Post and Kendal, 2002; S. Rifkin,2002

Page 23: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

23

ASCI codes take about 8 years to develop.

• We have studied the record of most of the ASCI code projects at LANL and LLNL to identify the necessary resources and schedule

• The requirements are well known and fixed. LANL and LLNL have been modeling nuclear weapons for 50 to 60 years.

• We find that it takes about 8 years and a team of at least 10 to 20 professionals to develop a code with the minimum capability to provide a 3 Dimensional simulation of a nuclear explosion

• No code project that hasn’t had 8 years has succeeded• All of the code projects that have succeeded have been at

least 8 years old (although some have failed even with 8 years of work)

Post and Cook, 2000: Post and Kendall, 2002

Page 24: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

24

Initial use of metrics to estimate schedule and resource requirements

FPC SLOC

53C SLOC

128F77 SLOC

107

Key parameter is a “function point” —FP—, a weighted total of inputs, outputs, inquiries, logical files and interfaces

SLOC: Single Line of Code

T. Capers-Jones, 1998Schedule FPx 0.4 x 0.5; use x .47

Team Size FP

150

Corrections for Lab environment compared to industrySchedule= FP schedule + delays

Delays up to 1.5 years for recruiting, clearance, learning curveSchedule multiplier of 1.6 for complexity of project

1.6 based on contingency required compared to industry due to complexity of classified/unclassified computing environment, unstable and evolving ASCI platforms, paradigm shift to massively parallel computing, need for algorithm R&D, complexity of models, etc.

predicted bugs FP1.25 predicted documentation FP1.15

D. Remer, 2000; E. Yourdon, 1997

Page 25: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

25

LLNL LANL

Capers Jones ASCI A ASCI B Legacy A

AnteroCode

ProjectShavano

Code ProjectCrestoneCode Project

Single Lines ofCode 184000 640000 410550 300000 500000 314000Function Points(Eq. 1) 4800 6100 5400 2900 4800 2900Estimatedschedule(Eq.4) 8.7 9.0 6.9 6.6 8.1 6.7Project age—initialmilestone date 3 9 N/A 4 3.5 8Successful—initialASCI Milestone No Yes N/A No No YesEstimated staffrequirements(Eq.3) 22 27 24 14 22 14real team size (1997-2002) 20 22 8 17 8 12

Comparison of ASCI code history and estimation results

Post and Cook, 2000: Post and Kendall, 2002

Page 26: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

26

ASCI codes require about 8 years and 20 professionals

range of weaponssimulation codes

Project time

peak codeteam size

Time

Cod

e P

roje

ctS

taff

Leve

l

Classic staffing curve

Staffing curve for continued Development and user support

• Estimation procedures are consistent with LANL and LLNL history

• About 8 years and 20 professionals are required.

D. Remer, 2000; Post and Kendall, 2002

Page 27: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

27

Result of not estimating the time and resources needed to complete the requirements was failure of

many code projects to meet their milestones.

• Code Projects were asked to do 8 years of work in 3.5 years+ Management changed the requirements 6 months before the due date

• Result:– 3 out of 5 code projects “failed” to meet the milestones (those with less than

8 years of experience) – 2 out of 5 code project succeeded in meeting the milestones (those with

more than 8 years of experience)

• Management at LLNL and LANL blamed the code teams, and in some cases, punished them

• Morale was devastated, project teams were reduced• Only now—3 years later—some “failed” projects are beginning to

recover• Real cause was no realistic planning and estimation by upper level

management• DOE and labs now drafting more appropriate milestones and schedules

Post and Cook, 2000: Post and Kendall, 2002

Page 28: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

28

Focus on Customer• Customer focus is a feature of every successful project

• Inadequate customer focus is a feature of many, if not most, unsuccessful projects

• Code projects are unsuccessful unless the users can solve real problems with their codes

• LLNL and LANL encourage customer focus by:– Organizationally embedding the code teams in user groups– Collocation of users and code teams– Encourages daily informal interactions– Involve users in the code development process, e.g. testing, setting

requirements, assessing progress,….– Code development teams share in success of users

• Trust between users and developers is essential

• Common management helps balance competing concerns and issues

• Incremental delivery has helped build and maintain customer focus

User

McBreen, 2001; D. Phillips, 1997: R. Thomsett, 2002; E. Verzuh, 1999

Page 29: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

29

Better Physics is the key• Paradigm shift from interpolation among nuclear

test data points to extrapolation to new regimes and conditions requires much better physics– Can’t rely on adjusting parameters to exploit

compensating errors

• Predictive ability of codes depends on the quality of the physics and the solution algorithms– Correct and appropriate equations for the problem,

physical data and models, accurate solutions of equations,…

• ASCI has funded a parallel effort to develop better algorithms and physical databases and physics models (turbulence, opacities, cross sections,..)

R. Laughlin, 2002;Post and Kendall, 2002

Page 30: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

30

Employ modern computer science techniques, but don’t do computer science research.

• Main value of the project is improved science (e.g. physics)• Implementing improved physics (3-D, higher resolution,

better algorithms, etc.) on the newest, largest massively parallel platforms that change every two years is challenging enough. Don’t increase the challenge!!!

• Use standard industrial computer science. Computer science must be a reliable and useable tool!!!!! — Don’t do computer science research

• Beyond state-of-the-art computer science (new frameworks, languages, etc.) has generally proven to be at best an impediment and at worst a project killer

• LANL spent over 50% of its code development resources on a project that had a major computer science research component. It was a massive failure.

T. Demarco, T. Lister (2002); Post and Cook, 2000: Post and Kendall, 2002

Page 31: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

31

Software Quality Engineering: Practices rather than Processes

• Attention to project management is more important• SQA and SQE are big issues for DOE and DOD codes (10 CFR 830,

etc.)• Stan Rifkin defines 3 categories of software (S. Rifkin, 2002) :

– Operationally excellent—one bug kills, airplane control software– Customer intimate—flexible with lots of features, MS Office– Product innovative—new and innovative capabilities, scientific codes

• Heavy emphasis on process necessary for “operationally excellent” software, where bugs are fatal

• Heavy process stifles innovative software—the kind of innovations necessary to solve computational scientific and technical problems

– Scientists trained to question authority, look for value added for any procedure

• ASCI had more success emphasizing “Best practices” than “Good processes”

• If the code team doesn’t implement SQA on their own terms, the sponsors may implement on theirs, and the teams won’t like it much less

DeMarco and Boehm, 2002; DPhillips,1997; Remer, 2000; DeMarco and Lister, 2002; Rifkin, 2002; Post and Cook, 2000: Post and Kendall, 2002

Page 32: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

32

Verification and Validation• Customers (e.g. DOD) want to know why they should

believe code results • Codes are only a model of reality• Verification and Validation are essential• Verification

– Equations are solved correctly– Mathematical exercise– Regression suites of test problems, convergence tests,

manufactured solutions, analytic test problems

• Validation– Ensure models reflect nature – Check code results with experimental data – NNSA is funding a large experimental program to provide validation

data• National Ignition Facility, DAHRT, ATLAS, Z,…

NIF

DAHRT

Roach, 1998; Roache, 2002; Salari and Knupp, 2000; Lindl, 1998; Lewis, 1992; Laughliin, 2002)

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Page 33: March 22, 2004 1 The Coming Crisis in Computational Science Douglass Post Los Alamos National Laboratory SCIDAC Workshop Charleston, South Carolina, March

March 22, 2004

33

Conclusions• If Computational Science is to fulfill its promise for society, it will need to

become as mature as theoretical and experimental methodologies.• Prediction Challenge

• Need to analyze past experiences, successes and failures, develop “lessons

learned” and implement them—DARPA HPCS doing case studies of ~ 20

major US code projects (DoD, DOE, NASA, NOAA, academia, industry,…)• Major lesson is that we need to improve:

•Verification•Validation

•Software Project Management and Software Quality

• Programming Challenge• HPC community needs to reduce the difficulty of developing codes for

modern platforms—DARPA HPCS developing new benchmarks,

performance measurement methodologies, encouraging new development

tools, etc.