Why Do Computational Scientists Trust Their So

1. Why do climate modellers trust their software? Jon Pipitone Advisor: Steve Easterbrook University of Toronto @Uvic, June 2009

2. This presentation

Quick and dirty

I'd prefer discussion over me just blabbering

Tell a good story 3. Get feedback from you

Approach is good?

4. What am I missing? 5. "If we knew what we were doing, it wouldn't be called research, would it?" - attributed to Albert Einstein 6. What is climate modelling? A kind ofcomputational science 7. What is computational science?

A scientific computing approach to gain understanding, mainly through the analysis of mathematical models implemented on computers

Wikipedia, computational science 8. What is computational science?

Computers and software are the lab equipment.Virtual laboratories.

9. Program outputs are the results of the experiment. 10. Scientific software development my focus 11. What is climate modelling?

Climatologists build computer models of the climate to try to understand climate processes.

12. What is climate modelling? Source: Easterbrook,CUSEC'09 (Source: IPCC AR4, 2007) 13. What is climate modelling? Source: Easterbrook,CUSEC'09 (Source: IPCC AR4, 2007) 14. General Circulation Models Source: Easterbrook,CUSEC'09 Crown Copyright 15. Scientific software development 16. Scientific software development 17. Verification and Validation

Desk checking

Informal unit test, some use of debuggers

Science Review and Code Review

Science review by project managers

18. Code review by designated code owners

Continuous testing as Science Experiments

Automated test harness on main trunk

JP: physical constraints

Bit reproducibility (strong constraint)

19. Model-intercomparisons Source: Easterbrook, CUSEC'09 20. Basic Validation Steps

Simulate stable climate (with no forcings)

Models can produce climates with tiny changes in mean temperature but with seasonal and regional changes that mimic real weather.

Reproduce past climate change

When 20th century forcings are added model should match observations.

Reproduce pre-historic climates

Can model last ice age and advance of Sahara desert.

Source: Easterbrook, CUSEC'09 21. Validation Notes Source: Easterbrook,CUSEC'09 Crown Copyright 22. Core problems with V&V 23. Validation notes Bit reproducibility Core problems with V&V 24. In other words,

This is science.It's difficult to specify concrete requirements beforehand.

25. What does this mean for software quality? 26. (note, we're not talking about model quality!) 27. How do we judge quality in scientific software? 28. Software Quality

Software quality is a big concept with many facets, or -ilities

e.g. reliability, modularity, customisability,

We're used to thinking of the quality of software as how well it is designed and how well it matches our requirements. 29. Measuring Quality: Defect Density

Defect Density = # defects / LOC

30. Can we benchmark quality using defect density?

(It is the most common rough quality measure from what I've seen.)

Preliminary observation: defect density for climate models is lower than comparably-sized industrial projects. 31. Hadley defect rates Some comparisons: NASA Space shuttle:0.1 failures/KLOCBest military systems: 5 faults/KLOC Worst military systems: 55 faults/KLOC Apache: 0.5 faults/KLOC XP: 1.4 faults/KLOCHadleys Unified Model: avg of 24 bug fixes per release avg of 50,000 lines edited per release 2 defects / KLOC make it through to released code

expected defect density in current version: 24 / 830,000 0.03 faults/KLOC

Source: Easterbrook, CUSEC'09 ? ? 32. Few Defects Post-release

Obvious errors:

Model wont compile / wont run

33. Model crashes during a run 34. Model runs, but variables drift out of tolerance 35. Runs dont bit-compare (when they should) Subtle errors (model runs appear valid):

Model does not simulate the physical processes as intended(e.g. some equations / parameters not correct)

36. The right results for the wrong reasons (e.g. over-tuning) 37. Expected improvement not achieved Source: Easterbrook, CUSEC'09 38. Measuring Quality: Defect Density So,Is climate modelling software really that good? 39. On Benchmarking

Comparing defect rates isverysubjective:

Ultimately depends on testing strategy

40. When are we counting: pre- or post-release?

41. How do we factor in severity?

Pareto law: 20% of the bugs cause 80% of the problems

Bug type: A bug is* not a bug, across projects. 42. No standards in the literature 43. What are we measuring?

Absolute values suck, we don't know what they mean

Guy from Software Quality workshop at ICSE

We don't have an underlying theory of software quality yet

i.e. how do all these -ilities relate and correspond to the world?

44. We could ask ...

What are the important aspects of quality for computational scientists?

45. We could ask ...

What makes a piece of softwaregood ?

46. What makes a piece of softwarebad ? 47. How do you know when you're done? 48. How do you train newcomers? 49. or... 50. When have you had to delay releasing due to a bug?Why? 51. Tell me the story behind these and other bugs... 52. Questions to ask...

SW Quality Measurement: A Framework for counting problems and defects(Florac, SEI: TR22.92)

Finding Activity:What activity discovered the problem or defect? Finding Mode:How was the problem or defect found? Problem Type:What is the nature of the problem? If a defect, what kind? Criticality : How critical or severe is the problem or defect? Related Changes:What are the prerequisite changes? .... Why did the bug go unnoticed?Why is it important to have fixed this bug then?How was the bug fixed? Why is the fix appropriate? 53. Why study climate modellers?

Socially relevant

54. Already have connections with CM groups 55. Preliminary data suggesting the quality of their models is high:

What can we learn from them?

56. What can we teach them?

A good example of well-established computational science.

57. My study:

Why do climate modellers trust their code?

58. What do climate modellers do when coding to guarantee correctness? 59. What are their notions of quality wrt to code? 60. How can we benchmark computational scientists' code quality? 61. My study:

Detailed analysis of defect density

Pre- and post-release defect counts

62. Discover through bug reports and version control comments

e.g. check-in comments with fixed, bug #, etc..

Defect density over releases (trends) 63. Breakout by defect types (but, what are they?) 64. Maybe: static fault density using automated tool 65. Examine several climate models (>3?) 66. My study:

Qualitative investigation

Semi-structured interview of climate modellers

67. Use questions given previously to guide conversation 68. Investigate the story of a defect, judgement calls that were made. 69. ~5 defect stories per climate modelling centre 70. Cross-case analysis 71. Outcomes

Towards a theory of code quality for climate modelling (computational science?) software

Empirical basis

72. Future: relevant quality benchmark Benchmarking statistics for climate modelling code

Useful for climate modelling groups

Learn from CS; Where can we can help? 73. Questions?

How well did I present the background to the study?

74. objective the study? 75. Issues with the study itself?

no direct investigation of code quality, only problems

To some extent throughfault analysis

76. What would I look for? Others?

Technology

Why Do Computational Scientists Trust Their So