59
Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015 DOI: 10.6084/m9.figshare.1330219

Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Embed Size (px)

Citation preview

Page 1: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Software workflows as research objects & GigaGalaxy

Rob L Davidson, Chris I HunterISI CODATA International Training Workshop on Big Data

11th March 2015DOI: 10.6084/m9.figshare.1330219

Page 2: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Article: http://econ.st/1o12gCN DOI: 10.6084/m9.figshare.1330219

Page 3: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

DOI: 10.6084/m9.figshare.1330219

Page 4: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

• Big data! (The new oil)

• New dot com bubble?

Article: http://bit.ly/1AN8ysJ DOI: 10.6084/m9.figshare.1330219

Page 5: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Source: @flowchainsensei

Analysis

Software

DOI: 10.6084/m9.figshare.1330219

Page 6: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Article: http://bit.ly/1xdCxbY DOI: 10.6084/m9.figshare.1330219

Page 7: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Article: http://bit.ly/1Mdll03 DOI: 10.6084/m9.figshare.1330219

Page 8: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Yay, we’re all unicorns!

from: Are you recruiting a data scientist or a unicorn?

DOI: 10.6084/m9.figshare.1330219http://ubm.io/1Gpxizh

Page 9: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

But why arewe sad unicorns?

DOI: 10.6084/m9.figshare.1330219

Page 10: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Measuring software reproducibility

• Systematic study:• 515 papers (429 conference, 86 journal)• <30% reproducible

DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Page 11: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Page 12: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Reasons for failure

“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”

DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu

Page 13: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Cost of failure

• Waste time• Waste money• Frustrating• Distrust

DOI: 10.6084/m9.figshare.1330219

Page 14: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

How to fix it

DOI: 10.6084/m9.figshare.1330219

Page 15: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

The path to enlightenment

• Look to the experts (4 x 10 simple rules)• Share code

– licenses• Share environment

– Codify the environment• Share workflows

– All parameters, versions, order of steps– GalaxyProject.org

• Share outputs– Share intermediate results– Share code for figures– Codify publications

DOI: 10.6084/m9.figshare.1330219

Page 16: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Look to the expertsDOI: 10.6084/m9.figshare.1330219

Page 17: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Look to the expertsDOI: 10.6084/m9.figshare.1330219

Page 18: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

A word from the experts: 1

• Keep it simple– Don’t be a perfectionist– Aim for multiple versions– Optimise/improve later– Get feedback/help from community

• Hastings #1 + Prlic # 5

DOI: 10.6084/m9.figshare.1330219

Page 19: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

A word from the experts: 2

• Versioning – Use a versioning system (e.g. Github)– Allow others to know what version they use– Release early, release often (Linus Torvalds)– Get help from community

• Seemen # 3, Hastings # 10, Sandve #3/4

DOI: 10.6084/m9.figshare.1330219

Page 20: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

A word from the experts: 3

• Use good coding practice– You don’t have to be the best– Learn from others– Become involved in a community– Write as though others will be watching

• Prlic #2 + all of Seemen and Hastings

DOI: 10.6084/m9.figshare.1330219

Page 21: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

A word from the experts: highlight

• Start simple• Release early• Use versioning• Build a community• Get community feedback, testing, support

• …but wait, won’t that mean???

DOI: 10.6084/m9.figshare.1330219

Page 22: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing code

DOI: 10.6084/m9.figshare.1330219

Page 23: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing code

• “Scientific software…public release is then only considered around the time of publication” – prlic #4

• “the fear of getting scooped”– Reality: “staking a claim in the field”

DOI: 10.6084/m9.figshare.1330219

Page 24: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing code: don’t worry• Share early

– Be simple– Don’t be perfectionist

• CRAPL license

Source: http://matt.might.net/articles/crapl/ DOI: 10.6084/m9.figshare.1330219

Page 25: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing code: licenses• Know your licenses

– Apache License 2.0– BSD 3-Clause “New” or “Revised”– BSD 2-Clause “simplified” or “FreeBSD”– GNU (GPL)– MIT– Mozilla Public License 2.0– etc

Source: http://opensource.org/licenses DOI: 10.6084/m9.figshare.1330219

Page 26: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing code: repositories

• Github• Sourgeforge• Zenodo• GigaDB/GigaGalaxy

• Versioning, sharing, collaboration, community feedback

DOI: 10.6084/m9.figshare.1330219

Page 27: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing environment

DOI: 10.6084/m9.figshare.1330219

Page 28: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Your environment

• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos?

• IF it took you a while to set up your box, if you hesitate to set it up for your colleagues…– Create a virtual machine or ‘docker’ image that

can be shared whole. – Time-stamp of working system

DOI: 10.6084/m9.figshare.1330219

Page 29: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share your environment• Virtual machine

– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time

DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-23

Page 30: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share your environment

• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called like a compiled tool

• New possibilities e.g. nucleotid.es benchmarking– Data-driven peer-review

DOI: 10.6084/m9.figshare.1330219http://nucleotid.es/

Page 31: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share your environment

• VM = black box?• Docker == black box!• http://ivory.idyll.org/blog/vms-considered-

harmful.html

DOI: 10.6084/m9.figshare.1330219

Page 32: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Codify your environment

• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for

alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a

compiled snapshot (i.e. full machine)

DOI: 10.6084/m9.figshare.1330219

Page 33: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Short list of provisioning systems

• Vagrant• Chef• Salt• Puppet• Ansible

• Many more – see link for info

DOI: 10.6084/m9.figshare.1330219http://bit.ly/1wrYiuI

Page 34: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing workflows

DOI: 10.6084/m9.figshare.1330219

Page 35: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share your workflow

• Any analysis is a string of tools with a great many parameters

• The order of the sequence, the version of each part and the inputs and outputs are never fully explained

• These should be shared!• Help is at hand: there are many ‘workflow’

systems for this

DOI: 10.6084/m9.figshare.1330219

Page 36: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Workflow systems

• Galaxy• Knime• Taverna• Many more…

• GigaScience uses Galaxy– galaxy.cbiit.cuhk.edu.hk

DOI: 10.6084/m9.figshare.1330219

Page 37: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy

Over 36,000 main Galaxy server users

Over 1,000 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

http://galaxyproject.org DOI: 10.6084/m9.figshare.1330219

Page 38: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy User Interface

Tool List Tool Parameters History/results

DOI: 10.6084/m9.figshare.1330219

Page 39: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy: Under the hood

<tool name=”myfunction”> <command> python myfunction input1 </command> <inputs> <param format=”txt” name=”input1”> </inputs> <outputs> <data format=”csv” name=”output1”> </outputs></tool>

Basic xml 'wrapper'

Describe inputs and outputs

Calls command

Monitors for output

Logs/returns to 'history'

DOI: 10.6084/m9.figshare.1330219

Page 40: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy Workflow: visualiseDOI: 10.6084/m9.figshare.1330219

Page 41: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy Workflow: visualiseDOI: 10.6084/m9.figshare.1330219

Page 42: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy Workflow: visualise

DOI: 10.6084/m9.figshare.1330219

Page 43: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy Workflow: exportDOI: 10.6084/m9.figshare.1330219

Page 44: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Citable workflowAdd as supplemental files or publish with distinct DOI via GigaDB or FigShare

DOI: 10.6084/m9.figshare.1330219

Page 45: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Galaxy Toolshed

https://toolshed.g2.bx.psu.edu/

Many 'omics, stats,

visualisations

2700+ tools!

Download;Run instantly

DOI: 10.6084/m9.figshare.1330219

Page 46: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

GigaGalaxyWeb Site: galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219

Page 47: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Page 48: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

SOAPdenovo2 workflows implemented in

Implemented entire workflow in our Galaxy server, inc.:

• 3 pre-processing steps

• 4 SOAPdenovo modules

• 1 post processing steps

• Evaluation and visualization tools

Also will be available to download by >50K Galaxy users in

galaxy.cbiit.cuhk.edu.hk

Page 49: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

Page 50: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Sharing outputs

DOI: 10.6084/m9.figshare.1330219

Page 51: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share outputs – intermediate results

• Workflow systems help with this– Results in history

• If a part of your analysis can’t be replicated– Requires a license– Is no longer compatible – Just plain won’t work

• The rest of the analysis can still be used

DOI: 10.6084/m9.figshare.1330219

Page 52: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share outputs – code for figures

• Data transform for figures– Remove points?– 3D: choose ‘best angle’? – PCA: choose ‘best components’?

• Figure choice– Bar chart or box&whisker?

• Allow reinterpretation!!!

DOI: 10.6084/m9.figshare.1330219

Page 53: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Share outputs – codify publication• “This article is an example of a literate programming document. It has

been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database”

DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-3

Page 54: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

Literate coding options

• See listing: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy

DOI: 10.6084/m9.figshare.1330219

Page 55: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

SUMMARY

Page 56: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

The path to enlightenment

• Look to the experts (4 x 10 simple rules)• Share code

– licenses• Share environment

– Codify the environment• Share workflows

– All parameters, versions, order of steps– GalaxyProject.org

• Share outputs– Share intermediate results– Share code for figures– Codify publications

DOI: 10.6084/m9.figshare.1330219

Page 57: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

All Your Research Objects

• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text

DOI: 10.6084/m9.figshare.1330219

Page 58: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

DOI: 10.6084/m9.figshare.1330219

@gigasciencefacebook.com/GigaScience

Scott EdmundsPeter LiChris HunterRob DavidsonJesse Si ZheNicole NogoyLaurie GoodmanAmye Kenall (BMC)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

blogs.biomedcentral.com/gigablog/

Page 59: Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA International Training Workshop on Big Data 11 th March 2015

DOI: 10.6084/m9.figshare.1330219