Tools for reproducible and accessible science VMs, KnitR and OMERO Rob Davidson Cardiac Physiome...

Preview:

Citation preview

Tools for reproducible and accessible science

VMs, KnitR and OMERORob Davidson

Cardiac Physiome WorkshopAuckland, April 8th 2015

All Your Research Objects

• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text

Source: DOI: 10.6084/m9.figshare.1330219

GigaSolution: deconstructing the paperCombines and integrates:

Open-access journal

Data Publishing Platform

Data Analysis Platform

Today’s message

• Tools that fit with GigaDB– General purpose Research Object store

• Enhancing– Accessibility– Reproducibility

• Of some of your research objects– Software– images

Problems with scientific software - reproducibility

Measuring software reproducibility

• Systematic study:• 515 papers (429 conference, 86 journal)• <30% reproducible

DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Reasons for failure

“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”

DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Cost of failure

• Waste time• Waste money

– Ioannidis 2014 – 85% resources wasted

• Frustrating• Distrust

DOI: 10.6084/m9.figshare.1330219DOI: 10.1371/journal.pmed.1001747

Literate programming - KnitR

Literate programming

• Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.– Donald E. Knuth, Literate Programming, 1984

Literate programming options

• See listing: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy

DOI: 10.6084/m9.figshare.1330219

KnitR is versatile

R

Python

Ruby

HaskellPerl

SAS

Coffeescript

.txt

LaTeX

HTML

D3.js

R Markdown

HTML5 slides

Command line Any text?

WordPress

KnitR – how does it work?

• Code chunks– Basic text (or latex or markdown), interrupted by

‘chunks’ of code• For latex, similar to Sweave

…some text \Sexpr{rfunc(var)} more text……some text <<language, chunk_name, chunk_options>>=Some code@

• Process this combined text/code with knit() in R

KnitR uses: easy to explainDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

KnitR uses: reproducible analysis

• Can string different tools/languages together • Stores parameters• Just like a pipeline/workflow system

– E.g. galaxy, taverna, Knime

• But also: codifies your figures…

KnitR uses – codified figuresDOI: 10.6084/m9.figshare.1330219

• Classic problems:• No description of error

bars• No description of

distributions

• Admittedly this could be fixed by ‘proper’ peer review

Source code: http://bit.ly/1NQZlHh

KnitR uses: codified figuresDOI: 10.6084/m9.figshare.1330219

• Code can be found quickly• Using text as markers

• Plot can be altered – 1 line of code

• New visualisation produced instantaneously

• Better evaluation of results

Source code: http://bit.ly/1NQZlHh

GigaScience KnitR example• “This article is an example of a literate programming document. It has

been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database”

DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-3

Environment wrappers - VMs

DOI: 10.6084/m9.figshare.1330219

Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Your environment

• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos? Or

just upgrade?

• Dependencies / Versions• System settings• Hard for you, horrendous for others!

DOI: 10.6084/m9.figshare.1330219

Share your environment• Virtual machine

– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time

DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-23

Share your environment

• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called from command line– Can be linked together

• New possibilities e.g. nucleotid.es – Benchmarking -> “data-driven peer-review”?

DOI: 10.6084/m9.figshare.1330219http://nucleotid.es/

Share your environment

• Some concerns:– http://ivory.idyll.org/blog/vms-considered-harmfu

l.html– VM = black box?– Docker == black box!

Solution-> codify the environment

DOI: 10.6084/m9.figshare.1330219

Codify your environment

• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for

alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a

compiled snapshot (i.e. full machine)

DOI: 10.6084/m9.figshare.1330219

Short list of provisioning systems

• Vagrant• Chef• Salt• Puppet• Ansible

• Many more – see link for info

DOI: 10.6084/m9.figshare.1330219Source: http://bit.ly/1wrYiuI

Images: release ALL the images with OMERO

“And now for something completely different”

NO

Phenotyping with microCTdoi:10.1186/2047-217X-2-14

NO

Phenotyping with microCTdoi:10.1186/2047-217X-3-6

Hosting Images• Image LIMS

• Links to GigaDB • Can handle most

formats• Web embedding

• View online, no need for software

• Open Source

www.openmicroscopy.org/site/products/omero

www.openmicroscopy.org/site/products/omero

OMERO: providing access to imaging data

View, filter, measure raw images with direct links from journal article.

See all image data, not just cherry picked examples.

Download and reprocess.

OMERO: Adding value http://jcb-dataviewer.rupress.org/

The alternative...

...look but don't touch

Thanks for listening!

Acknowledgements• GigaTeam

– Scott Edmunds– Peter Li– Chris Hunter– Jesse Xiao– Nicole Edmunds– Laurie Goodman

Where to get these slides• FigShare DOI:

Recommended