37
LLNL-PRES-676991 1 LLNL-PRES-676991 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Responding to the Software Crisis in DOE Scientific Computing September 14, 2015 DOE, Germantown, MD. Gregory Pope CSQE

Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769911

LLNL-PRES-676991This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Responding to the Software Crisis in DOE Scientific Computing

September 14, 2015

DOE, Germantown, MD.

Gregory Pope CSQE

Page 2: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769912

1. ASC Program Software Engineering Lessons Learned at LLNL

2. Modeling Code Reliability Requirements for Exascale

3. Study Group Results

4. Next Steps, Risk Management 

Topics

Page 3: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769913

Software Engineering progress came in stages:

ASC Program Lessons Learned at LLNL

(1) Whine, Nag, Know It All

(2) Documentation

(3) Measurement

(4) Measurement Based Improvements

(5) ImprovementImplementers

A Software Quality EngineeringMaturity Model, Pope and HillLLNL-CONF-413143

http://silverbuckshot.net/SQE-Model

To Get this paper:

Page 4: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769914

Top Down Theory X management style is less effective with Knowledge Workers

ASC Program Lessons Learned at LLNL ‐ Level 1

Page 5: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769915

ASC Program Lessons Learned at LLNL – Level 1

Whining, Nagging, Know It All ‐ is ineffective with knowledge workers

You must You must do it my

way because I

say so

Page 6: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769916

ASC Program Lessons Learned at LLNLDOE/NNSA Compliance is Confusing – Level 2

Risk Management

2.1

Planning 2.6

Design Reviews 3.3.4

Procurement 3.6

SQA Process 3.16 a

Applicable Parts 3.16 b

Safety and Weapon Related 3.16 c

Complexity and Risk of Failure

3.16 d

Risk Based Approach

3.16 e

Methodology 3.16 f

Requirements 3.16 g

Software CM 3.16 h

Software Verification and

Validation 3.16 i

Software PM and Quality

Planning WA 1

Software Risk Management

WA 2

Software Configuration Management

WA 3

Procurement and Supplier

Management WA 4

Requirements Identification and

Management WA 5

Software Design and

Implementation WA 6

Software Safety WA 7

Software Verification and

Validation WA 8

Problem Reporting and Corrective

Action WA 9

Training WA 10

Management Reviews

1.1

Peer Reviews

1.2

Unit Testing 1.3

Integration Testing

1.4

Regression Testing

1.5

User Acceptance

Testing 1.6

Training Verify.

Methods Techniques

1.7

Project Planning

2.1

Requirements Management

2.2

Design Management

2.3

Implemen-tation 2.4

Maintenace 2.5

Collabor-ations

2.6

Process Improvement

2.7

Disaster Recovery

2.8

Training Software Practice Methods

2.9

Configura-tion

Control 3.1

Release Management

3.2

Training Configuration Management

3.3

Page 7: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769917

ASC Program Lessons Learned at LLNLCompliance Flow Down and Simplification – Level 2

LLNL ISQAP

SQAPPhys. Code 1

LLNL ASCRequired Practices

SQAPPhys. Code 2

SQAPPhys. Code 3

ASCSQE: Guidelines

SQAPEng. Code

Many SQAPSLib/Feeders

DOE O 414.1D NAP-24

Agility/Discipline

Risk Grade

Page 8: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769918

Group of dedicated Software Quality Engineers (SQE’s).

Embedded SQE’s into the development teams.

Make compliance easy and automated.

Find and roll out tools and processes.

SQE’s most effective if developers themselves.

Now being requested by code teams to focus on automated building, testing, release, static and dynamic code analysis, tool research. 

ASC Program Lessons Learned at LLNLASC Program SE Lessons Learned at LLNL – Level 3

Page 9: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769919

ASC Program Lessons Learned at LLNLDual Reporting Chain – Level 3

Charlie VerdonWCI PAD

Joe SefcikASC V&V

Mike McCoyASC Program Manager

Chris ClouseComputational Physics Lead

Project LeadPhysics Code 1

Project LeadPhysics Code 2

Project LeadEngineering

Project LeadPhysics Code 3

Gregory PopeSQE Project Lead

Evelyn ChenLibrary Data (50%)

Tammy Dahlgren Tools, UQ Pipeline

Michael ChangEngineeringStatic Analysis

Bill AimonettiPhysics Code 1

Ellen HillPhysics Code 2and Feeders

Stephanie Dempsey

Physics Code 3ATS

Page 10: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699110

ASC Program Lessons Learned at LLNLMeasuring Things – Level 3

Major Codes Lead SQE Req Des Imple Oper Maint Retire SCM V&V PM Overall

RG 6 6 9 8 8 n/a 8 8 6 7.4 JH 5 5 8 8 8 n/a 7 8 6 6.9 NK 7 7 8 8 8 n/a 8 8 6 7.5 BO 6 3 6 8 8 n/a 6 6 4 5.9

BO 6 4 8 8 8 8 7 6 6.9EH 5 4 8 7 7 n/a 7 7 6 6.4SA 7 6 6.5 EH 4 2 6 8 8 n/a 5 6 4 5.4EH SA 4 2 5 7 7 n/a 6 7 3 5.1 BO 6 6 8 8 7 n/a 7 7 6 6.9BO 6 6 8 8 7 n/a 7 7 7 7.0

BO

BO SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9 EH 6 6 6 7 7 n/a 8 7 6 6.6BO EH BO BO BO EH BO BO BOBO

BORG 5 3 8 8 9 n/a 6 6 4 6.1EHBO

AVERAGES 5.0 3.7 6.6 7.5 7.4 n/a 6.6 6.3 4.6 6.0

Page 11: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699111

ASC Program Lessons Learned at LLNLImprove Things – Level 4 

Major Codes Lead SQE Req Des mplementatiOperationa Maint Retire SCM V&V PM Overall

SD 9 9 9 10 10 n/a 10 10 10 9.6 EH 10 10 10 10 10 n/a 10 10 10 10.0 NK 10 10 10 10 10 n/a 10 10 10 10.0 BO 9 6 9 10 10 n/a 10 10 7 8.9BOBO

BO 6 6 8 8 8 8 7 6 7.1EH 7 6 8 7 7 n/a 9 7 6 7.1EH 7 6 8 8 8 7 6 8 7.3 EHEH 9 7 9 9 9 n/a 10 10 10 9.1EH 9 7 9 9 9 n/a 10 10 10 9.1EH 9 7 9 9 9 n/a 10 10 10 9.1EH 9 7 9 9 9 n/a 10 10 10 9.1EH 9 7 9 9 9 n/a 10 10 10 9.1EH 5 5 6 9 9 n/a 10 10 10 8.0EH 8 8 9 9 9 n/a 10 9 9 8.9 BO 4 4 5 7 7 n/a 6 7 5 5.6 BO 7 6 9 9 8 n/a 10 10 6 8.1BO 7 6 9 9 8 n/a 10 10 7 8.3

BO 7 7 9 9 9 8 7 7 7.9

BO 7 7 9 9 9 8 7 7 7.9 GP 7 6 5 7 7 n/a 5 6 4 5.9GP 7 6 5 7 7 n/a 5 6 4 5.9GP 7 6 5 7 7 n/a 5 6 4 5.9GP 7 6 5 7 7 n/a 5 6 4 5.9GP 7 6 5 7 7 n/a 5 6 4 5.9 EH 10 7 9 7 7 n/a 10 10 9 8.6 EH EH 8 8 9 10 10 n/a 10 10 8 9.1JH 6 6 8 8 8 8 7 6 7.1BO EH JH JH BO 7 6 9 9 8 n/a 10 10 7 8.3

BO 7 6 8 8 9 9 8 7 7.8SD 9 5 8 8 9 n/a 10 10 10 8.6EH 9 9 10 10 10 n/a 10 10 9 9.6

AVERAGES 7.7 6.6 8.0 8.5 8.4 n/a 8.6 8.5 7.5 8.0

Major Codes Lead SQE Req Des Imple Oper Maint Retire SCM V&V PM Overall

RG 6 6 9 8 8 n/a 8 8 6 7.4 JH 5 5 8 8 8 n/a 7 8 6 6.9 NK 7 7 8 8 8 n/a 8 8 6 7.5 BO 6 3 6 8 8 n/a 6 6 4 5.9

BO 6 4 8 8 8 8 7 6 6.9EH 5 4 8 7 7 n/a 7 7 6 6.4SA 7 6 6.5 EH 4 2 6 8 8 n/a 5 6 4 5.4EH SA 4 2 5 7 7 n/a 6 7 3 5.1 BO 6 6 8 8 7 n/a 7 7 6 6.9BO 6 6 8 8 7 n/a 7 7 7 7.0

BO

BO SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9SA 4 2 5 7 7 n/a 6 5 3 4.9 EH 6 6 6 7 7 n/a 8 7 6 6.6BO EH BO BO BO EH BO BO BOBO

BORG 5 3 8 8 9 n/a 6 6 4 6.1EHBO

AVERAGES 5.0 3.7 6.6 7.5 7.4 n/a 6.6 6.3 4.6 6.0

Major Codes

Feeder Codes

2008 to 2015

Page 12: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699112

Cannot check in code without passing smoke test.

Cannot run smoke test without a peer review check.

Dedicated System Testing resources.

Distributed CM tools like STASH/GIT to product main branch.

Design trade offs embedded into code – e.g. Doxygen.

Requirements tracked in tracking and linkage tools – e.g. JIRA.

Acquisition and maintenance automated testing tools – e.g. ATS.

Nightly or CI automated test reports/DB, visualization.

Static and Dynamic Code Analysis.

Create records as artifacts of Value Added Tools.

ASC Program Lessons Learned at LLNLImplement Improvements – Level 5

Page 13: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699113

Modeling Code Reliability Requirements

Application Domain Number Projects

Error Range(Errors/KESLOC)

Normative Error Rate(Errors/KESLOC)

Notes

Automation 55 2 to 8 5 Factory automationBanking 30 3 to 10 6 Loan processing, ATMCommand & Control 45 0.5 to 5 1 Command centersData Processing 35 2 to 14 8 DB-intensive systemsEnvironment/Tools 75 5 to 12 8 CASE, compilers, etc.Military -All 125 0.2 to 3 < 1.0 See subcategories§ Airborne 40 0.2 to 1.3 0.5 Embedded sensors§ Ground 52 0.5 to 4 0.8 Combat center§ Missile 15 0.3 to 1.5 0.5 GNC system§ Space 18 0.2 to 0.8 0.4 Attitude control systemScientific 35 0.9 to 5 2 Seismic processingTelecommunications 50 3 to 12 6 Digital switchesTest 35 3 to 15 7 Test equipment, devicesTrainers/Simulations 25 2 to 11 6 Virtual reality simulatorWeb Business 65 4 to 18 11 Client/server sitesOther 25 2 to 15 7 All others

Reifer, Donald, “Industry Software Cost, Quality, and Productivity Benchmarks,”

DoD Software Tech News, June 2004, http://www.softwaretechnews.com/stn7-2/reifer.html.

Page 14: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699114

Tracking and Comparing Fault Density on ASC

14

Physics Code Estimated MeasuredA 3.0 2.5B 3.5 1.6C 2.0 .91

Scientific 35 0.9 to 5 2 Seismic processing

Application Domain Number Projects

Error Range(Errors/KESLOC)

Normative Error Rate(Errors/KESLOC)

Notes

So how long before a .91 defects/KSLOC fails?

Page 15: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699115

Converting Defect Density (.91) to Failure RateSingle 1.6GHz Processor (12 GFLOPS), 578K SLOCs

15

Hours YearsProbability of Failure Free

Probability of Failure

1 99.99% 0.01%3 99.97% 0.03%8 99.91% 0.09%

160 98.24% 1.76%480 94.81% 5.19%

1000 89.48% 10.52%5000 0.6 57.37% 42.63%

10000 1.1 32.91% 67.09%50000 5.7 0.39% 99.61%

100000 11.4 0.00% 100.00%

Probability of Failure Free in 8 hours = 99.91%

Single 1.6GHz Processor (12.8 GFLOPS), 578K SLOCs

Get the Article: http://onlinelibrary.wiley.com/doi/10.1002/0471028959.sof327/abstract

Encyclopedia of Software Engineering, William W. Everett, John D. Musa

Page 16: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699116

Converting Defect Density to Failure Rate100 Processor Case

16

Probability of Failure Free in 8 hours = 99.82%

100 1.6GHz Processor (1.28 TFLOPS), 20K SLOCs

Hours Years

Probability of Failure Free

Probability of Failure

0.000277778 Seconds 100.00%

0.016666667 Minutes 100.00%

1 99.98% 0.02%

3 99.93% 0.07%

8 99.82% 0.18%

160 96.50% 3.50%

480 89.87% 10.13%

1000 80.05% 19.95%

5000 0.6 32.87% 67.13%

10000 1.1 10.80% 89.20%

50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

Page 17: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699117

Failure Rate Versus Speed (Today)

17

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 100 1000 10000 100000 1000000

Prob

abili

ty o

f Fai

lure

Fre

e 8

Hou

rs

Number of Cores

12.8 GF 1.28 TF 12.8 TF 128 TF 1.28 PF 12.8 PFFLOPS

Today's .91 per KSLOC Reliability Versus Number Processors and FLOPS

Sequoia = 17PF

Page 18: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699118

Failure Rate Versus Speed (Future)

18

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 100 1000 10000 100000 1000000 10000000 100000000

Prob

abili

ty o

f Fai

lure

Fre

e 8

Hou

rs

Number of Cores

12.8 GF 1.28 TF 12.8 TF 128 TF 1.28 PF 12.8 PF 128PF 1.28EFFLOPS

Needed .01 per KSLOC Reliability Versus Number Processors and FLOPS

Page 19: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699119

Some Thoughts 

Need code 10 times better than space shuttle.

Assuming when we find bugs we fix them.

Then we must find and fix bugs up to 100 times faster.

We must improve our abilities to prevent, find, and fix bugs or have short simulations.

Blue Gene/Q (Sequoia)  is 17 PFLOPS Rmax, we have not come close to reaching capacity.

19

Page 20: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699120

Study Group Ground Rules

The technique used was to solicit elevator speeches: a short and concise write up done as if the author was a speaker with only a few minutes to convince a decision maker of their top issues.

20

Page 21: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699121

Study Group Contributors:Multiple Labs, Academia, DOE.

21

Tom McAbee LLNL Patty Loo INLAndy Salinger, SNLRich Hornung LLNLJeffery Carver, University of AlabamaRob Neely LLNLRobert Blyth, DOE-IDDavid E. Bernholdt ORNLRoscoe A. Bartlett, ORNL Michael Heroux, SNLJim Willenbring, SNL

Tamara Dahlgren LLNLEllen Hill LLNLGreg Pope LLNLBurl Hall, LLNLDavid Jefferson LLNL, Stephanie Dempsey LLNLTom Epperly, LLNLMark Miller LLNLJeff Keasler LLNL

http://silverbuckshot.net/study-groupTo Get the Report:

Page 22: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699122

Study Group Highest Results (2.8‐2.5)

Using Appropriate Software Engineering Practices.

Upping Perceived Values of Software Engineering.

Adequate Testing Resources.

Use of Graded Approach.

Computational science community needs more awareness, knowledge, understanding, and experience with good software engineering practices for scientific applications. 

22

Page 23: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699123

Nest Steps Risk Management 

RISK: Development in hero mode and code correctness/maintainability suffer.

RISK: Best practices not communicated.

RISK: Testing treated as an afterthought.

RISK: Level of rigor too high or too low.

RISK: Contemporary tools not available.

MITAGATION: Adopt and follow a team based software development model.

MITAGATION: Hold Best Practices meetings, exclude management. 

MITAGATION: Plan for robust testing environment, separate from release and development. 

MITAGATION: Use Risk grading tool, compliance automated with tools.

MITAGATION: Target use of open source and COTS tools.

23

Page 24: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699124

Conclusion

Good Software Engineering requires an organization and dedicated resources, it does not happen without effort.

We will need to get much better at Software Engineering practices for scientific codes (Error prevention, detection, and removal).

Our DOE complex developers know this.

Manage today’s concerns as risk mitigation, learning and improving as we go.

Page 25: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699125

Back Up Slides

Page 26: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699126

Converting Defect Density to Failure Rate1,000 Processor Case

26

Probability of Failure Free in 8 hours = 98.24%

1000 1.6GHz Processor (12.8 TFLOPS), 20K SLOCs

Hours Years

Probability of Failure Free

Probability of Failure

0.000277778 Seconds 100.00%0.016666667 Minutes 100.00%

1 99.78% 0.22%3 99.33% 0.67%8 98.24% 1.76%

160 70.05% 29.95%480 34.37% 65.63%

1000 10.80% 89.20%5000 0.6 0.00% 100.00%

10000 1.1 0.00% 100.00%50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

Page 27: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699127

Converting Defect Density to Failure Rate10,000 Processor Case

27

Probability of Failure Free in 8 hours = 83.69%

10000 1.6GHz Processor (128 TFLOPS), 20K SLOCs

Hours YearsProbability of Failure Free

Probability of Failure

0.000277778 Seconds 100.00%

0.016666667 Minutes 99.96%

1 97.80% 2.20%

3 93.54% 6.46%

8 83.69% 16.31%

160 2.84% 97.16%

480 0.00% 100.00%

1000 0.00% 100.00%

5000 0.6 0.00% 100.00%

10000 1.1 0.00% 100.00%

50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

Page 28: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699128

Converting Defect Density to Failure Rate100,000 Processor Case

28

100000 1.6GHz Processor (1.28 PFLOPS), 20K SLOCs

Hours Years

Probability of Failure Free

Probability of Failure

0.000277778 Seconds 99.99%

0.016666667 Minutes 99.63%

1 80.05% 19.95%

3 51.30% 48.70%

8 16.86% 83.14%

160 0.00% 100.00%

480 0.00% 100.00%

1000 0.00% 100.00%

5000 0.6 0.00% 100.00%

10000 1.1 0.00% 100.00%

50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

Probability of Failure Free in 8 hours = 16.86%

Page 29: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-6769912929

Hours Years

Probability of Failure Free

Probability of Failure

0.000277778 Seconds 99.94%

0.016666667 Minutes 96.36%1 10.80% 89.20%

3 0.13% 99.87%

8 0.00% 100.00%

160 0.00% 100.00%

480 0.00% 100.00%

1000 0.00% 100.00%

5000 0.6 0.00% 100.00%

10000 1.1 0.00% 100.00%

50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

1000000 1.6GHz Processor (12.8 PFLOPS), 20K SLOCs

Probability of Failure Free in 8 hours = 0%

Page 30: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699130

Achieving Todays Failure Free Rates on Tomorrows Computers 

30

So what defect rate would be necessary for >90% failure free for 8 hours?

Fault Density of .05 gives 94.77% failure free for 8 hours18.2 times better than our best scientific research codeHalf the defect rate of the Space Shuttle code

Hours Years

Probability of Failure Free

Probability of Failure

0.000277778 Seconds 100.00%

0.016666667 Minutes 99.99%

1 99.33% 0.67%

3 98.00% 2.00%

8 94.77% 5.23%

160 34.14% 65.86%

480 3.98% 96.02%

1000 0.12% 99.88%

5000 0.6 0.00% 100.00%

10000 1.1 0.00% 100.00%

50000 5.7 0.00% 100.00%

100000 11.4 0.00% 100.00%

Page 31: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699131

Study Group Results (2.4‐2.3)

HPC Software Engineering Centers

Tools for Automated Processes

Staffing, Hiring, and Retention

Standards Support

Management of Advanced Code Projects

Simulation Ensembles

31

Page 32: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699132

Study Group Results (2.2‐2.0)

Schedule Expectations

Maintaining a large scientific software use on multiple architectures

Data, Libraries, Operating Systems

COTS Tools

Onboard SQA

Data and Algorithm Organization ion Physics Code

Lean/Agile Lifecycle Model for Research Driven CSE/HPC software

Visual Debugging32

Page 33: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699133

Study Group Results (1.9‐1.8)

Fine Grained Parallelism Challenges

Managing Ever Increasing Need for Compliance Driven Agile Development

Compliance to DOE Standards

Thread Rescoping Tool

Attributes of a highly Effective Software Environment

33

Page 34: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699134

Study Group Ranking

Management:

22 2.2 2 8 ____Schedule Expectation:[1]24 2.4 5 4 1 ___Staffing:[2] Hiring and Retention of Staff:[3]

19 1.9 2 5 3 ___Managing the Ever‐Increasing Need for Compliance‐Driven Agile Development:[4]23 2.3 4 5 1 ___Management of Advanced Code Projects:[5]   

    Process:   27 2.7 7 3 ____Graded Approach:[6]28 2.8 9 1 ___Use of Appropriate Software Engineering Practices[7]28 2.8 8 2 ___Upping the Perceived Value of Software Engineering:[8]16 1.8 3 1 5 ___Attributes of a highly effective software environment:[9]28 2.5 7 3 1 ___The computational science community needs more awareness, knowledge, understanding, and experience with best practices in software engineering: [10]20 2.0 3 4 3 ___Lean/Agile lifecycle model for research‐driven CSE/HPC software.  [11]24 2.4 5 4 1 ___HPC Software Engineering Centers:[12]19 1.9 4 1 5 ___Compliance to DOE Standards:[13]23 2.3 4 5 1 ___Standards support:[14]   

    Tools:   21 2.1 2 7 1 ___COTS Tools:[15]21 2.1 3 5 2 ___On Board SQA:[16]22 2.2 5 2 3 ___Maintaining a large scientific software base on multiple architectures  (RAJA)[17]24 2.4 5 4 1 ___Tools for Automating Processes[18]23 2.3 4 5 1 ___Simulation Ensembles:[19]18 1.8 2 4 4 ___Thread Rescoping Tool[20]19 1.9 3 3 4 ___Fine‐grained parallelism challenges:[21]20 2.0 2 6 2 ___Visual Debugging:[22]27 2.7 8 1 1 ___Testing Resources[23]

   

    Data:22 2.2 2 8 ___Data, Libraries, and Operating Systems:[24]

  Libraries[25}  Operating Systems[26]

17 2.1 3 3 2 ___Data and algorithm organization in physics codes:[25]

Page 35: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699135

Next Steps

RISK: Schedule overly optimistic

RISK: Inability to attract CS top talent

RISK: Attrition of existing employees

RISK: Standards and compliance overly restrictive

RISK: Lack of SE experience 

RISK Best practices not communicated 

RISK: Contemporary tools not available 

RISK: Dev approach not agile enough

RISK: Level of rigor too high or too low

RISK: Development in hero mode and code correctness/maintainability suffer.

35

Page 36: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699136

Risk Management 

RISK: Funding for hardening, productizing, and maintaining tools Is not available.

RISK: Compliance to standard will be cumbersome

RISK: Latest compilers, Libraries and operating systems not available or supported

RISK: Not able to take advantage of main stream productivity tools

RISK: Insufficient resources to support user needs 

RISK: Manual recoding adds defects to reliable codes

RISK: Not testing all platform types supported

RISK: Not having sufficient simulation result management tools

RISK: Code modification for porting causes decrease in code reliability

Page 37: Responding to the Software Crisis in DOE Scientific Computingideas-productivity.org/wordpress/wp-content/uploads/2015/09/02-asc-pope.pdfNational Laboratory under contract DE-AC52-07NA27344

LLNL-PRES-67699137

Risk Management 

RISK: Not having a quick way to identify anomalies at scale decreases productivity

RISK: Testing treated as an afterthought 

RISK: Development environment overly specialized or expensive

RISK: Architecture sprawls and is not optimized

37