You Got Your Engineering in my Data Science - Addressing the Reproducibility Crisis with Software...

Preview:

Citation preview

YOU GOT YOUR ENGINEERING IN MY DATA SCIENCE

ADDRESSING THE REPRODUCIBILITY CRISIS WITH SOFTWARE ENGINEERING

1

WE SEE PATTERNS2

SCIENCE USED TO BE A SOLO OPERATION…

3

THE OVERALL HIGGS ANALYSIS WAS PERFORMED BY A TEAM OF MORE THAN 600 PHYSICISTS.

“Who Really Found the Higgs Boson” -Neal Hartman, Nautilus Issue 18

…BUT NOW IT’S NOT

4

DATA SCIENCE IMPROVES

EVERYTHING 5-1

5-2

5-3

Clinical recommendations discouraging the use of CYP2D6 gene testing to guide tamoxifen therapy in breast cancer patients are based on studies with flawed methodology and should be reconsidered, according to the results of a Mayo Clinic study published in the Journal of the National Cancer Institute.

Joe Dangor, Mayo Clinic News Network December 9, 2014

5-4

SEARCHING FOR PATTERNS

6

7

8

PROBLEMS WITH ANALYSIS TOOLS

FALSE POSITIVES IN FMRI RESEARCH

9-1

PROBLEMS WITH ANALYSIS TOOLS

FALSE POSITIVES IN FMRI RESEARCH

▸ After crunching the numbers, “we think that around 3,000 studies could be affected,” says Dr Eklund. But without revisiting each and every study, it is impossible to know which those 3,000 are.

9-2

PROBLEMS WITH PROCESS

PSYCHOLOGICAL RESEARCH

10-1

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

PSYCHOLOGICAL RESEARCH

10-2

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

PSYCHOLOGICAL RESEARCH

10-3

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

▸ 270 co-authors tried to reproduce 100 studies

PSYCHOLOGICAL RESEARCH

10-4

PROBLEMS WITH PROCESS

▸ “Estimating the reproducibility of psychological science”

▸ Brian Nosek, Science, August 2015

▸ 270 co-authors tried to reproduce 100 studies

▸ 36% could be reproduced

PSYCHOLOGICAL RESEARCH

10-5

PROBLEMS WITH PROCESS

PSYCHOLOGICAL RESEARCH

“Nosek said there were three possible reasons for his results: that the original effect could have been false positive, that the replication was a false negative, or that both the original and replication results are accurate but that each experiment’s methodology differed in significant ways.”- Colleen Flaherty Inside Higher EdAugust 2015

11

PROBLEMS WITH DATA

12-1

11% OF STUDIES REPRODUCIBLE

PROBLEMS WITH DATA

12-2

PROBLEMS WITH DATA

“For results that could not be reproduced, however, data were not routinely analyzed by investigators blinded to the experimental versus control groups. Investigators frequently presented the results of one experiment, such as a single Western-blot analysis. They sometimes said they presented specific experiments that supported their underlying hypothesis, but that were not reflective of the entire data set. There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process.”

- C. Glenn Begley

13

IT CAN BE PROVEN THAT MOST CLAIMED RESEARCH FINDINGS ARE FALSE.John Ioannidis

14

THE REPRODUCIBILITY CRISIS

15

16

IT WORKS ON MY MACHINE

Every Single Software Developer Ever

REPRODUCIBILITY IN SOFTWARE ENGINEERING

17

VERSION YOUR CODE AND DATA

VERSION CONTROL

18

USE A BUILD SCRIPT

19

REVIEW YOUR CODE20

21

DEFINE STANDARD FORMATS

22

FUZZING23

USE IT RELEASE IT

OPEN SOURCE

24

TAKE ADVANTAGE OF MODERN TECHNOLOGY

25

CREATING INTERACTIVE PUBLICATIONS

“Truly Interactive Science Publishing was shown to have enough educational value that readers were willing to invest in the needed set–up and learning phases. Problems encountered in network and computer speed can now be minimized by running the ISP software in a cloud computing environment which will minimize the dependence on local computer and network speeds. The social aspects of data sharing and the enlarged review process may be the hardest obstacles to overcome.”

-Dr. Michael Ackerman

26

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

27-1

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

27-2

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

27-3

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

27-4

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

27-5

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

27-6

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

27-7

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

▸ Open source when you can

27-8

PUTTING IT ALL TOGETHER

BEST PRACTICES FOR SOFTWARE ENGINEERING AND DATA SCIENCE

▸ Version

▸ Provide a build script

▸ Review

▸ Run automated positive and negative tests

▸ Stick to standards

▸ Use open source when you can

▸ Open source when you can

▸ Take advantage of technology

27-9

THERE IS NO SILVER BULLET

28

THANKS TO

▸ Andrew Schechtman-Rook

▸ Jacqueline Kazil

▸ Jeanie Drury

29

WHO AM I

JONATHAN BODNER

▸ Tech Fellow, Capital One

▸ jonathan.bodner@capitalone.com

▸ @jonbodner

30

Image and Content Credits:

2. http://www.telescope.com/assets/images/starcharts/2016-10-starchart_col.png

3. https://xkcd.com/1584/

4. http://nautil.us/issue/18/genius/who-really-found-the-higgs-boson

5. https://news.virginia.edu/content/capital-one-cio-talks-big-data-innovation-ahead-tonight-s-information-session, http://newsnetwork.mayoclinic.org/discussion/mayo-clinic-genotyping-errors-plague-cyp2d6-testing-for-tamoxifen-therapy/, https://www.google.com/patents/US8615473, https://www.bloomberg.com/news/articles/2016-09-20/microsoft-develops-ai-to-help-cancer-doctors-find-the-right-treatments

6. By Lokilech - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1804667

7. http://news.stanford.edu/news/2012/september/austen-reading-fmri-090712.html

8. http://www.popsci.com/science/article/2010-05/hollywood-science-how-your-brain-reacts-horror-movies

9. http://www.economist.com/news/science-and-technology/21702166-two-studies-one-neuroscience-and-one-palaeoclimatology-cast-doubt

11. https://www.insidehighered.com/news/2015/08/28/landmark-study-suggests-most-psychology-studies-dont-yield-reproducible-results

12. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html

14. http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

31

Image and Content Credits:

15. http://xkcd.com/1574/

16. https://www.flickr.com/photos/vannispen/4608436679

18. https://xkcd.com/1597/

20. https://xkcd.com/1695/

21. http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

22. https://xkcd.com/927/

23. https://www.flickr.com/photos/lamenta3/4349576638

24. https://www.flickr.com/photos/jalbertbowdenii/5682524083

25. http://quod.lib.umich.edu/j/jep/3336451.0018.201?view=text;rgn=main

28. https://www.flickr.com/photos/eschipul/4160817135

32

Recommended