17
Reproducible Computational Workflows with Continuous Analysis Brett K. Beaulieu-Jones 1 , and Casey S. Greene 2,+ 1 Genomics and Computational Biology Graduate Group. Perelman School of Medicine. University of Pennsylvania. 2 Department of Systems Pharmacology and Translational Therapeutics. Perelman School of Medicine. University of Pennsylvania + Corresponding Author email: [email protected] address: 3400 Civic Center Blvd. 10-131 SCTR Philadelphia, PA 19103 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted August 11, 2016. ; https://doi.org/10.1101/056473 doi: bioRxiv preprint

Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

ReproducibleComputationalWorkflowswithContinuousAnalysis

BrettK.Beaulieu-Jones1,andCaseyS.Greene2,+

1GenomicsandComputationalBiologyGraduateGroup.PerelmanSchoolofMedicine.UniversityofPennsylvania.2DepartmentofSystemsPharmacologyandTranslationalTherapeutics.PerelmanSchoolofMedicine.UniversityofPennsylvania+CorrespondingAuthoremail:[email protected]:3400CivicCenterBlvd.10-131SCTRPhiladelphia,PA19103

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 2: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Abstract

Reproducingexperimentsisvitaltoscience.Beingabletoreplicate,validateandextendpreviousworkalsospeedsnewresearchprojects.Reproducingcomputationalbiologyexperiments,whicharescripted,shouldbestraightforward.Butreproducingsuchworkremainschallengingandtimeconsuming.Intheidealworldwewouldbeabletoquicklyandeasilyrewindtotheprecisecomputingenvironmentwhereresultsweregenerated.Wewouldthenbeabletoreproducetheoriginalanalysisorperformnewanalyses.Weintroduceaprocesstermed“continuousanalysis”whichprovidesinherentreproducibilitytocomputationalresearchataminimalcosttotheresearcher.ContinuousanalysiscombinesDocker,acontainerservicesimilartovirtualmachines,withcontinuousintegration,apopularsoftwaredevelopmenttechnique,toautomaticallyre-runcomputationalanalysiswheneverrelevantchangesaremadetothesourcecode.Thisallowsresultstobereproducedquickly,accuratelyandwithoutneedingtocontacttheoriginalauthors.Continuousanalysisalsoprovidesanaudittrailforanalysesthatusedatawithsharingrestrictions.Thisallowsreviewers,editors,andreaderstoverifyreproducibilitywithoutmanuallydownloadingandrerunninganycode.Exampleconfigurationsareavailableatouronlinerepository(https://github.com/greenelab/continuous_analysis).

TheCurrentStateofReproducibility

Leadingscientificjournalshavetargetedreproducibilitytoincreasereaders’

confidenceinresultsandreduceretractions1–5.Inarecentsurvey,90%ofresearchersacknowledgedareproducibilitycrisis6.Researchthatusescomputationalprotocolsshouldbeparticularlyamenabletoreproducibleworkflowsbecauseallofthestepsarescriptedintoamachine-readableformat.Butwrittendescriptionsofcomputationalapproachescanbedifficulttounderstandandmaylackrequiredparameters.Evenwhenresultscanbereproduced,theprocessoftenrequiresasubstantialtimeinvestmentandhelpfromtheoriginalauthors.Garijoetal.7estimateditwouldtake280hoursforanon-experttoreproduceapaperdescribingacomputationalconstructionofadrug-targetnetworkforMycobacteriumtuberculosis8.Thesearethegoodscenarios:theresultsbehindmostcomputationalpapersarenotreadilyreproducible7,9–11.

Thepracticeof“openscience”hasbeendiscussedasameanstoaidreproducibility3,12.Inopensciencethedataandsourcecodeareshared.Sharingcanalsoextendtointermediateresultsandprojectplanning13.Sharingdataandsourcecodeiscurrentlynecessarybutnotsufficienttomakeresearchreproducible.Evenwhencodeanddataareshared,itremainsdifficulttoreproduceresultsduetodifferingcomputingenvironments,operatingsystems,librarydependenciesetc.Itiscommontouseoneormoreopensourcelibrariesonaproject,andresearchcodequicklybecomesdependentonoldversionsoftheselibrariesassoftwareadvances14.Theseoldorbrokendependenciesmakeitdifficultforreadersand

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 3: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

reviewerstorecreatetheenvironmentoftheoriginalresearchers,whethertovalidateorextendtheirwork. Anexampleofwheresharingdatadoesnotautomaticallymakesciencereproducibleoccursinthemoststandardofplaces:differentialgeneexpressionanalysis.Suchanalysesareroutine.Ourunderstandingofthegenome,includingtranscriptomeannotations,haveimprovedandupdatedprobesetdefinitionsareavailable15.Analysesrelyingonunspecifiedprobesetdefinitionscannotbereproducedusingcurrentdefinitions.

WeanalyzedthefifteenmostrecentlypublishedpapersthatciteDaietal.,acommonsourceforcustomchipdescriptionfiles(CustomCDF),thatwereaccessibleatourinstitution16–31.WeidentifiedthesemanuscriptsusingWebofScienceonMay31,2016.WerecordedthenumberofpapersthatcitedaversionofCustomCDF,aswellaswhichversionwascited.Weexpectthisanalysistoprovideanupperboundonreproduciblework:thesepapersspecificallycitedthesourceoftheirCDFs.Ofthesefifteenpapers,nine(60%)specifiedwhichversiontheyused.Thesenineusedversions11,15,16,17,18,and19oftheBrainArrayCustomCDF.

Thisinitialanalysiswasperformedbasedonarticlerecencywithoutregardtoarticleimpact.Todeterminetheextenttowhichthisissueaffectshighimpactpapers,weperformedaparallelevaluationforthetenmostcitedpapers32–41thatciteDaietal.WedeterminedthetenmostcitedpapersusingWebofScienceonMay31,2016.Ofthesetenpapers,one38(10%)specifiedwhichversionoftheCustomCDFwasused.Thatpaperusedversion11oftheBrainArrayCustomCDF. Wesoughttodeterminewhichversionswerecurrentlyinuseinthefield.Weaskedthreeindividualswhoperformedmicroarrayanalysisrecently,andweaccessedandevaluatedtwoclustersystemsusedforprocessingdata.Wefoundthateachindividualhadoneofthethreemostrecentlyreleasedversionsinstalled(18,19,and20),andversions18and19wereinstalledonclustersystems. ToevaluatetheimpactofdifferingCDFversions,wedownloadedarecentlypublishedpublicgeneexpressiondataset(GEOSeriesAscensionnumberGSE47664).ThisexperimentexamineddifferentialexpressionbetweennormalHeLacellsandHeLacellswithTIA1andTIARknockeddown42.Weperformedaparallelanalysisusingeachofthethreeversionsthatwefoundinstalledonmachinesthatwecouldaccess(18,19,and20).Eachversionidentifiesadifferentnumberofsignificantlyalteredgenes(Figure1A),demonstratingthechallengeofreproducibleanalysis.WesimulatedaparallelanalysisofdifferentialexpressionusingDockercontainersonmismatchedmachines43.ThisspecifiestheCDFversionandproducesthesamenumberandsetofdifferentiallyexpressedgenesforagivenversionacrossmachines(v18exampleinFigure1B).HadcontinuousanalysisbeenusedforpaperscitingtheBrainArrayCustomCDFtheircomputationalresultswouldbeeasilyreplicated.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 4: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure1.Currentstateofresearchcomputingvs.container-basedapproaches.A.)Thestatusquorequiresareaderorreviewertofindandinstallspecificversionsofdependencies.Thesedependenciescanbecomedifficulttofindandmaybecomeincompatiblewithnewerversionsofothersoftwarepackages.Differentversionsofpackagesidentifydifferentnumbersofsignificantlydifferentiallyexpressedgenesfromthesamesourcecodeanddata.B.)Containersdefineacomputingenvironmentthatcapturesdependencies.Incontainer-basedsystems,theresultsarethesameregardlessofthehostsystem.

ContinuousAnalysisinComputationalWorkflows

Wedevelopedcontinuousanalysistoproduceaverifiableend-to-endrunofcomputationalresearchwithminimalstart-upcosts.Incontrastwiththestatusquo,continuousanalysispreservesthecomputingenvironmentandmaintainstheversionsofdependencies.Wedescribedthebenefitsofcontainerizedapproachesabove,butmaintaining,runninganddistributingDockerimagesmanuallywouldbecometimeconsuming.IntegratingDockerintoacontinuousscientificanalysispipelinemeetsthreecriteria:(1)anyonecanre-runcodeinacomputingenvironmentmatchingtheoriginalauthors(SupplementalFigure1);(2)readersandreviewerscanfollowexactlywhatwasdoneinan“audit”fashionwithoutrunningcode(SupplementalFigure2&3);and(3)thesolutionimposeszeroto

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 5: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure2.Continuousanalysiscanbesetupinthreeprimarysteps(numbered1,2,and3).(1)TheresearchercreatesaDockercontainerwiththerequiredsoftware.(2)TheresearcherconfiguresacontinuousintegrationservicetousethisDockerimage.(3)Theresearcherpushescodethatincludesascriptcapableofrunningtheanalysesfromstarttofinish.ThecontinuousintegrationproviderrunsthelatestversionofcodeinthespecifiedDockerenvironmentwithoutmanualintervention.ThisgeneratesaDockercontainerwithintermediateresultsthatallowsanyonetorerunanalysisinthesameenvironment,producesupdatedfigures,andstoreslogsdescribingeverythingthatoccurred.Exampleconfigurationsareavailableinthesupplementarymaterialsaswellasouronlinerepository(https://github.com/greenelab/continuous_analysis).Becausecodeisruninanindependent,reproduciblecomputingenvironmentandproducesdetailedlogsofwhatwasexecuted,thispracticereducesoreliminatestheneedforreviewerstore-runcodetoverifyreproducibility.

minimalcostintermsoftimeandmoneyontheresearcher,dependingontheircurrentresearchprocess. Continuousanalysisextendscontinuousintegration44,acommonpracticeinsoftwaredevelopmentanddeployment.Continuousintegrationisasoftwaredevelopmentworkflowthattriggersanautomatedbuildprocesswheneverdeveloperschecktheircodeintoasourcecontrolrepository.Thisautomatedbuildprocessrunstestscriptsiftheyexist.Thesetestscancatchbugsintroducedintosoftware.Softwarethatpassestestsisautomaticallydeployedtoremoteservers.

Forcontinuousanalysis(Figure2),werepurposetheseservicesinordertoruncomputationalanalyses,updatefigures,andpublishchangestoonlinerepositorieswheneverrelevantchangesaremadetothesourcecode.Whenanauthorisreadytoreleasecodeorpublishtheirworktheycanexportthemostrecentcontinuousintegrationrun.Becausethisprocessgeneratesresultsinacleanandclearlydefinedcomputingenvironmentwithoutmanualintervention,reviewerscanbeconfidentthattheanalysesarereproducible.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 6: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Ineachprojectwemaintaindependencieswiththefreeopen-source

softwaretoolDocker45.Dockerdefinesan“image”thatallowsuserstodownloadandrunacontainer,aminimalistvirtualmachinewithapredefinedcomputingenvironment.Dockerimagescanbeseveralgigabytesinsize,butoncedownloadedcanbestartedinamatterofsecondsandhasminimaloverhead14.Inaddition,Dockerimagescanbeeasilytaggedtocoincidewithsoftwarereleasesandpaperrevisions.Atthetimeofsubmission,authorscanrunthe`dockersave`commandtoexportastaticfilethatcanbeuploadedtoservicessuchasFigshareorZenodotoreceiveaDOI.Forexample,wehaveuploadedourcontinuousanalysisenvironmentfortheexamplesinthispaper46.

Tosetupcontinuousanalysis,aresearcherneedstodothreethings.First

theymustcreateaDockerfile,whichspecifiesalistofdependencies.Second,theyneedtoconnectacontinuousintegrationservicetotheirversioncontrolsystemandprovidethecommandstoruntheiranalysis.Finally,theyneedtocommitandpushchangestotheirversioncontrolsystem.Manyresearchersalreadyperformthefirstandthirdtasksintheirstandardworkflow.

Thecontinuousintegrationsystemwillautomaticallyrerunthespecified

analysiswitheachchange,preciselymatchingthesourcecodeandresults.Itcanalsobesettolistenandrunonlywhenchangesaremarkedinaspecificway,e.g.bycommittingtoaspecific‘staging’branch.Forthefirstproject,thisprocesscanbeputintoplaceinlessthanaday.Forsubsequentprojects,thiscanbedoneinunderanhour.SettingupContinuousAnalysis

WehavecreatedaGitHubrepositorywithinstructionsforpaid,local,and

cloud-basedcontinuousanalysissetups47.Thesearefullydetailedinthesupplementarymaterialsandonlinerepository.HerewedescribehowcontinuousanalysiscanbesetupusingthefreeandopensourceDronesoftwareonaresearcher’spersonalcomputerandconnectedtotheGitHubversioncontrolservice.Thissetupisfreetousers.

1. InstallDockeronthecomputer.2. PulltheDroneimageviadocker:

sudodockerpulldrone/drone:latest

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 7: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure3.RegisteranewapplicationfortheDronecontinuousintegrationserver.SetthehomepageURLtobetheIPaddressoftheDronecomputer.SetthecallbackURLtothesameIPaddressfollowedby/authorize.

3. CreateanewapplicationinGitHub(Figure3).

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 8: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure4.RegisteranewapplicationfortheDronecontinuousintegrationserver.ThepayloadURLshouldbeintheformatofyour-ip/api/hook/github.com/client-id

4. AddawebhooktotheGitHubproject(Figure4).Thiswillnotifythecontinuousintegrationserverofanyupdatespushedtotherepository.

5. CreateaconfigurationfileontheDronecomputerat/etc/drone/dronerc

fillingintheclientinformationprovidedbyGitHubREMOTE_DRIVER=githubREMOTE_CONFIG=https://github.com?client_id=....&client_secret=....

6. Runthedronecontainer

sudodockerrundrone/drone:latest

Continuousanalysiscanbeperformedwithdozensoffullserviceproviders

oraprivateinstallationonalocalmachine,clusterorcloudservice47.Fullserviceproviderscanbesetupinminutesbutmayhavecomputationalresourcelimitsormonthlyfees.Privateinstallationsrequireconfigurationbutcanscaletoalocalclusterorcloudservicetomatchthecomputationalcomplexityofallwalksofresearch.Withfree,open-sourcecontinuousintegrationsoftware48,computingresourcesaretheonlyassociatedcosts.UsingContinuousAnalysis Aftersetup,runningcontinuousanalysisissimpleandfitsintoexistingresearchworkflowsthatusesourcecontrolsystems.Wehaveusedcontinuous

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 9: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

analysisinourownwork49.Wehavealsopreparedthreeexamplerepositories(detailedinsupplementalmaterials):

1. Anexampledemonstratingthesetupofcontinuousanalysiswithawidevarietyofservicesandconfigurations(highlightedbelow).

2. Aneasytofollowbasicphylogenytreebuildingexample,combiningsequencealignmentusingMAFFT50,formatconversionusingEMBOSSSeqret51,andtreecalculationanddrawingusing.

3. AnRNAexpressionanalysisworkflowexaminingorganoidmodelsofpancreaticcancerinmicebasedonworkfromBojetal.52usingdetailsandsourcecodepublishedbyBalli53.Thisexampleshowstheabilityofcontinuousanalysistoscaletolargecomputations.Thisexampleuseskallisto54,limma55,56,andsleuth57toanalyze150GBofgeneexpressiondataandapproximately480millionreads.

Todemonstratethesetupprocessanddifferentconfigurationsofcontinuousanalysisweshowasimpleexampleofcontinuousanalysiswithkallisto.TherecentlypublishedsoftwaretoolkallistoquantifiestranscriptabundanceinRNA-seqdata.Ourexamplere-runstheexamplesprovidedinkallistowitheachcommittoarepository.

1. Addascriptfiletore-runcustomanalysis.ForDrone,thisisa.drone.ymlfilethatspecifiescommandstoruneachstepoftheanalysis.AnexampleconfigurationisavailableinthecontinuousanalysisGitHubrepositoryaswellasthesupplementalmaterials.

2. Commitchangestothesourcecontrolrepository.3. PushchangestoGitHub.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 10: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure5.AuditlogsfromacontinuousintegrationrunwiththeserviceShippableforthekallistoexample. Theconfiguredcontinuousintegrationserviceautomaticallyrunsthespecifiedscript.Weconfiguredthistoreruntheanalysis,regeneratethefigures,andcommitupdatedversionstotherepository.Theserviceprovidesacompleteauditlogofwhatwasruninthecleancontinuousintegrationenvironment(Figure5).Bygeneratingandpushingupdatedfigures,thisprocessalsogeneratesacompletechangelogforeachresult(Figure6).Interactivedevelopmenttools,suchasJupyter58,59,RMarkdown60,61andSweave62canbeincorporatedtopresentthecodeandanalysisinalogicalgraphicalmanner.Forexample,werecentlyusedJupyterwithcontinuousanalysisinourownpublication63andcorrespondingrepository49.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 11: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

Figure6.ResultingfiguresfromtherunarecommittedbacktoGithubwherechangesbetweenrunscanbeviewed.A.)Theeffectofaddinganadditionalgene(HumanTw2)toaphylogenetictree-buildingexample.B.)Theeffectofaddinganadditionalgene(mt8)toanRNA-seqdifferentialexpressionexperimentPCAplot.

Insummary,continuousanalysisprovidestheresultsofaverifiableend-to-

endrunina“clean”environment.Becausecontinuousanalysisrunsautomaticallyinthebackground,notransitionisneededbetweentheexplorationandpublicationphasesofascientificproject.Theaudittrailprovidedbycontinuousanalysisallowsreviewersandeditorstoprovidesoundjudgmentonreproducibilitywithoutalargetimecommitment.Ifreadersorreviewerswouldliketore-runthecodeontheirown(e.g.tochangeaparameterandevaluatetheimpactonresults),theycaneasilydosowiththeDockercontainercontainingthefinalcomputingenvironmentandintermediateresults.Versioncontrolsystemsprovidethecapabilitytowatchforupdates.Readerscan“star”or“watch”arepositoryonservicessuchasGithub,Gitlab,andBitbuckettobeautomaticallynotifiedofchangesandupdatedruns.Wideadoptionofthesesystemsthroughoutthepublicationprocesscouldallowreviewersandeditorstoautomaticallybenotifiedofupdatedresults. Continuousanalysisprovidesanaudittrailforreproducibleanalysesofcloseddata. Continuousanalysiscanbeevenmorepowerfulwhenworkingwithcloseddatathatcannotbereleased.Withoutcontinuousanalysis,reproducing

C. D.

AdditionalSample

A. B.AdditionalSample

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 12: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

computationalanalysesbasedoncloseddataisdependentontheoriginalauthorscompletelyandexactlydescribingeachstep,aprocessthatmaybeanafterthoughtandrelegatedtoextendedmethods.Readersmustthendiligentlyfollowcomplexwritteninstructionswithoutintermediateconfirmationtheyareontherighttrack.Thecontainersproducedduringcontinuousanalysisincludeamatchingenvironmentforreplicationaswellasintermediateresults.Thisallowsreaderstodeterminewheretheirresultsdivergefromtheoriginalworkandtodeterminewhetherdivergenceisduetosoftware-basedordata-baseddifferences.Bestpracticeswithcontinuousanalysis Wesuggestadevelopmentworkflowwherecontinuousanalysisrunsonlyonasinglebranch(SupplementalFigure4).Researcherscanpushtothisbranchwhentheybelievetheyarereadyforareleasetoavoidrunningthefullprocessduringincompleteupdates.Iftheupdatestothisbranchsucceed,thechangesarethenautomaticallycarriedovertothemasterorproductionbranchandreleased.WerecommendexportingboththebeforeandafterprocessingDockerimagesanduploadingtoanarchivalservicelikeFigshareorZenodo.Thearchivedimagescanthenbecitedtoguidereaderstotheversionusedinthemanuscript46.Forconvenience,theimagescanalsobesharedthroughtheDockerHubregistry. Itmaycurrentlybeimpracticaltousecontinuousanalysisforgenericpreprocessingstepsinvolvingverylargedataoranalysesrequiringparticularlyhighcomputationalcosts.Inparticular,stepsthattakedaystorunorincursubstantialcostsincomputationalresourcesmaynotbeamenablewithexistingproviders64.Oneday,continuousanalysissystemsspecificallydesignedforscientificworkflowsmayfacilitatereproducibleworkflowsinthesesettings.Fornow,researchersmayneedtousediscretionwhenpreprocessingviacontinuousanalysis,asitmaybecomputationallyintractabletoreanalyzeaftereachcommittoastagingbranch.Researchersmayelecttorunonlythefinalworkflowthroughthisprocess,ormayelecttoemploycontinuousanalysisafterstandardbutcomputationallyexpensivepreprocessingstepsarecompleted. Forsmalldatasetsandlessintensivecomputationalworkflowsitiseasiesttouseafullservicecontinuousintegrationservice.Theseserviceshavethesmallestsetuptimes.Withprivatedataorwhendatasizeandcomputationalcomplexityscaleitbecomesnecessarytosetupalocalprivatelyhostedcontinuousintegrationserver.Clusterorcloudbasedcontinuousintegrationserverscanhandlethelargestworkflows.Theimpactofreproduciblecomputationalresearch

Reproducibilitycanhavewide-reachingbenefitsfortheadvancementofscience.Forauthors,easilyreproducibleworkisasignofqualityandcredibility.Continuousanalysisaddressesthereproducibilityofcomputationallyanalysesinthenarrowsense:generatingthesameresultsfromthesameinputs.Itdoesnot

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 13: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

solvereproducibilityinthebroadersense:howrobustresultsaretoparametersettings,startingconditionsandpartitionsinthedata.Continuousanalysislaysthegroundworkneededtoaddressreproducibilityandrobustnessoffindingsinthebroadsense.AcknowledgementsThisworkwassupportedbytheGordonandBettyMooreFoundationunderaDataDrivenDiscoveryInvestigatorAwardtoCSG(GBMF4552)andsupportedbyaCommonwealthUniversalResearchEnhancement(CURE)ProgramgrantfromthePennsylvaniaDepartmentofHealth.WewouldliketothankDavidBalliforprovidingtheRNA-seqanalysisdesign,KatieSiewertforprovidingthephylogeneticanalysisdesign,andAlexWhanforcontributingaTravis-CIimplementation.References1. Rebootingreview.NatBiotech.2015;33(4):319.

http://dx.doi.org/10.1038/nbt.3202.2. Softwarewithimpact.NatMeth.2014;11(3):211.

http://dx.doi.org/10.1038/nmeth.2880.3. PengRD.ReproducibleResearchinComputationalScience.Science(80-).

2011;334(6060):1226-1227.doi:10.1126/science.1213847.4. McNuttM.Reproducibility.Science(80-).2014;343(6168):229.

http://science.sciencemag.org/content/343/6168/229.abstract.5. Illuminatingtheblackbox.Nature.2006;442(7098):1.

http://dx.doi.org/10.1038/442001a.6. BakerM.1,500scientistsliftthelidonreproducibility.Nature.

2016;533(7604):452-454.doi:10.1038/533452a.7. GarijoD,KinningsS,XieLL,etal.Quantifyingreproducibilityincomputational

biology:Thecaseofthetuberculosisdrugome.PLoSOne.2013;8(11).doi:10.1371/journal.pone.0080278.

8. KinningsSL,XieLL,FungKH,JacksonRM,XieLL,BournePE.TheMycobacteriumtuberculosisdrugomeanditspolypharmacologicalimplications.PLoSComputBiol.2010;6(11).doi:10.1371/journal.pcbi.1000976.

9. BellAW,DeutschEW,AuCE,etal.AHUPOtestsamplestudyrevealscommonproblemsinmassspectrometry-basedproteomics.NatMethods.2009;6(6):423-430.doi:10.1038/nmeth.1333.

10. IoannidisJPA,AllisonDB,BallCA,etal.Repeatabilityofpublishedmicroarraygeneexpressionanalyses.NatGenet.2009;41(2):149-155.doi:10.1038/ng.295.

11. HothornT,LeischF.Casestudiesinreproducibility.BriefBioinform.2011;12(3):288-300.doi:10.1093/bib/bbq084.

12. GrovesT,GodleeF.Openscienceandreproducibleresearch.BMJ.2012;344.doi:10.1136/bmj.e4383.

13. ThinkLab.https://thinklab.com/.AccessedJanuary1,2016.14. BoettigerC.AnintroductiontoDockerforreproducibleresearch,with

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 14: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

examplesfromtheRenvironment.ACMSIGOPSOperSystRevSpecIssueRepeatabilitySharExpArtifacts.2015;49(1):71-79.doi:10.1145/2723872.2723882.

15. DaiM,WangP,BoydAD,etal.Evolvinggene/transcriptdefinitionssignificantlyaltertheinterpretationofGeneChipdata.NucleicAcidsRes.2005;33(20):e175.doi:10.1093/nar/gni179.

16. KopljarI,GallacherDJ,DeBondtA,etal.FunctionalandTranscriptionalCharacterizationofHistoneDeacetylaseInhibitor-MediatedCardiacAdverseEffectsinHumanInducedPluripotentStemCell-DerivedCardiomyocytes.StemCellsTranslMed.2016;5(5):602-612.doi:10.5966/sctm.2015-0279.

17. KarpińskiP,FrydeckaD,SąsiadekM.Reducednumberofperipheralnaturalkillercellsinschizophreniabutnotinbipolardisorder.Brain,Behav.2016.http://www.sciencedirect.com/science/article/pii/S0889159116300265.AccessedMay31,2016.

18. BrummelmanJ,RaevenR,HelmK.TranscriptomesignaturefordampenedTh2dominanceinacellularpertussisvaccine-inducedCD4+TcellresponsesthroughTLR4ligation.Scientific.2016.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4846868/.AccessedMay31,2016.

19. BilgrauA,EriksenP,RasmussenJ.GMCM:Unsupervisedclusteringandmeta-analysisusinggaussianmixturecopulamodels.JStat.2016.https://www.jstatsoft.org/article/view/v070i02/v70i02.pdf.AccessedMay31,2016.

20. GandinV,MasvidalL,CargnelloM,GyenisL.mTORC1andCK2coordinateternaryandeIF4Fcomplexassembly.Nature.2016.http://www.nature.com/ncomms/2016/160404/ncomms11127/full/ncomms11127.html.AccessedMay31,2016.

21. KilleenA,DiskinM,MorrisD.Endometrialgeneexpressioninhigh-andlow-fertilityheifersinthelatelutealphaseoftheestrouscycleandacomparisonwithmidlutealgeneexpression.Physiological.2016.http://physiolgenomics.physiology.org/content/48/4/306.abstract.AccessedMay31,2016.

22. CollettiN,LiuH,GowerA,AlekseyevY.Tlr3signalingPromotestheinductionofUniquehumanBDca-3DendriticcellPopulations.Front.2016.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4789364/.AccessedMay31,2016.

23. LeeM,HuangR,TongW.Discoveryoftranscriptionaltargetsregulatedbynuclearreceptorsusingaprobabilisticgraphicalmodel.ToxicolSci.2015.http://toxsci.oxfordjournals.org/content/early/2015/12/07/toxsci.kfv261.abstract.AccessedMay31,2016.

24. TroyN,HollamsE,HoltP.Differentialgenenetworkanalysisfortheidentificationofasthma-associatedtherapeutictargetsinallergen-specificT-helpermemoryresponses.BMCMed.2016.http://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-016-0171-z.AccessedMay31,2016.

25. ManiéE,PopovaT,BattistellaA.Genomichallmarksofhomologous

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 15: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

recombinationdeficiencyininvasivebreastcarcinomas.JCancer.2016.http://onlinelibrary.wiley.com/doi/10.1002/ijc.29829/full.AccessedMay31,2016.

26. DekkersB,HeH,HansonJ,WillemsL.TheArabidopsisDELAYOFGERMINATION1geneaffectsABSCISICACIDINSENSITIVE5(ABI5)expressionandgeneticallyinteractswithABI3duringArabidopsis.ThePlant.2016.http://onlinelibrary.wiley.com/doi/10.1111/tpj.13118/full.AccessedMay31,2016.

27. HoltP,StricklandD,BoscoA,BelgraveD.DistinguishingbenignfrompathologicTH2immunityinatopicchildren.JAllergy.2015.http://www.sciencedirect.com/science/article/pii/S0091674915013342.AccessedMay31,2016.

28. LückS,WestermarkP.CircadianmRNAexpression:insightsfrommodelingandtranscriptomics.CellMolLifeSci.2016.http://link.springer.com/article/10.1007/s00018-015-2072-2.AccessedMay31,2016.

29. BoscoA,WiehlerS,ProudD.Interferonregulatoryfactor7regulatesairwayepithelialcellresponsestohumanrhinovirusinfection.BMCGenomics.2016.http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2405-z.AccessedMay31,2016.

30. FauteuxF,HillJ,JaramilloM,PanY,PhanS.Computationalselectionofantibody-drugconjugatetargetsforbreastcancer.Oncotarget.2015.http://europepmc.org/abstract/med/26700623.AccessedMay31,2016.

31. NapolitanoF,SirciF,CarrellaD,BernardoDdi.Drug-setenrichmentanalysis:anoveltooltoinvestigatedrugmodeofaction.Bioinformatics.2016.http://bioinformatics.oxfordjournals.org/content/32/2/235.short.AccessedMay31,2016.

32. CarrollJ,MeyerC,SongJ,LiW,GeistlingerT.Genome-wideanalysisofestrogenreceptorbindingsites.Nature.2006.http://www.nature.com/ng/journal/v38/n11/abs/ng1901.html.AccessedMay31,2016.

33. LupienM,EeckhouteJ,MeyerC,WangQ,ZhangY.FoxA1translatesepigeneticsignaturesintoenhancer-drivenlineage-specifictranscription.Cell.2008.http://www.sciencedirect.com/science/article/pii/S0092867408001189.AccessedMay31,2016.

34. WangQ,LiW,ZhangY,etal.Androgenreceptorregulatesadistincttranscriptionprograminandrogen-independentprostatecancer.Cell.2009.http://www.sciencedirect.com/science/article/pii/S0092867409005170.AccessedMay31,2016.

35. LefterovaM,ZhangY,StegerD.PPARγandC/EBPfactorsorchestrateadipocytebiologyviaadjacentbindingonagenome-widescale.Genes&.2008.http://genesdev.cshlp.org/content/22/21/2941.short.AccessedMay31,2016.

36. TuupanenS,TurunenM,LehtonenR,HallikasO.ThecommoncolorectalcancerpredispositionSNPrs6983267atchromosome8q24conferspotentialtoenhancedWntsignaling.Nature.2009.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 16: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

http://www.nature.com/ng/journal/v41/n8/abs/ng.406.html.AccessedMay31,2016.

37. ObadS,SantosCdos,PetriA,HeidenbladM.SilencingofmicroRNAfamiliesbyseed-targetingtinyLNAs.Nature.2011.http://www.nature.com/ng/journal/v43/n4/abs/ng.786.html.AccessedMay31,2016.

38. HeH,MeyerC,ShinH,BaileyS,WeiG,WangQ.Nucleosomedynamicsdefinetranscriptionalenhancers.Nature.2010.http://www.nature.com/ng/journal/v42/n4/abs/ng.545.html.AccessedMay31,2016.

39. OzsolakF,SongJ,LiuX,FisherD.High-throughputmappingofthechromatinstructureofhumanpromoters.NatBiotechnol.2007.http://www.nature.com/nbt/journal/v25/n2/abs/nbt1279.html.AccessedMay31,2016.

40. ZuoT,WangL,MorrisonC,ChangX,ZhangH,LiW.FOXP3isanX-linkedbreastcancersuppressorgeneandanimportantrepressoroftheHER-2/ErbB2oncogene.Cell.2007.http://www.sciencedirect.com/science/article/pii/S0092867407005454.AccessedMay31,2016.

41. EnardW,GehreS,HammerschmidtK,HölterS.AhumanizedversionofFoxp2affectscortico-basalgangliacircuitsinmice.Cell.2009.http://www.sciencedirect.com/science/article/pii/S009286740900378X.AccessedMay31,2016.

42. NunezM,Sanchez-JimenezC,AlcaldeJ,IzquierdoJM.Long-termreductionofT-cellintracellularantigensrevealsatranscriptomeassociatedwithextracellularmatrixandcelladhesioncomponents.PLoSOne.2014;9(11).doi:10.1371/journal.pone.0113141.

43. Beaulieu-JonesB,GreeneC.ContinuousAnalysisBrainArray:SubmissionReleaseContinuousAnalysisBrainArray:SubmissionRelease.August2016.doi:10.5281/zenodo.59892.

44. DuvallP,MatyasS,GloverA.ContinuousIntegration:ImprovingSoftwareQualityandReducingRisk.;2007.http://portal.acm.org/citation.cfm?id=1406212.

45. Docker.Docker.https://www.docker.com.46. Beaulieu-JonesBK,GreeneCS.ContinuousAnalysisExampleDockerImages.

2016.10.6084/m9.figshare.3545156.v1.47. Beaulieu-JonesBK,GreeneCS.ContinuousAnalysis.GitHubrepository.

https://github.com/greenelab/continuous_analysis.Published2016.48. Drone.io.https://drone.io/.49. Beaulieu-JonesBK.DenoisingAutoencodersforPhenotypeStratification

(DAPS):PreprintRelease.Zenodo.January2016.doi:10.5281/zenodo.46165.50. KatohK,MisawaK,KumaK,MiyataT.MAFFT:anovelmethodforrapid

multiplesequencealignmentbasedonfastFouriertransform.NucleicAcidsRes.2002;30(14):3059-3066.doi:10.1093/nar/gkf436.

51. RiceP,LongdenI,BleasbyA,etal.EMBOSS:theEuropeanMolecularBiologyOpenSoftwareSuite.TrendsGenet.2000;16(6):276-277.doi:10.1016/s0168-

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint

Page 17: Reproducible Computational Workflows with Continuous Analysis · computational biology experiments, which are scripted, should be straightforward. ... The practice of “open science”

9525(00)02024-2.52. BojSF,HwangC-I,BakerLA,etal.OrganoidModelsofHumanandMouse

DuctalPancreaticCancer.Cell.2015;160(1):324-338.doi:10.1016/j.cell.2014.12.021.

53. BalliD.UsingKallistoforexpressionanalysisofpublishedRNAseqdata.https://benchtobioinformatics.wordpress.com/2015/07/10/using-kallisto-for-gene-expression-analysis-of-published-rnaseq-data/.Published2015.AccessedAugust1,2016.

54. BrayNL,PimentelH,MelstedP,PachterL.Near-optimalprobabilisticRNA-seqquantification.NatBiotechnol.2016;34(5):525-527.doi:10.1038/nbt.3519.

55. RitchieME,PhipsonB,WuD,etal.limmapowersdifferentialexpressionanalysesforRNA-sequencingandmicroarraystudies.NucleicAcidsRes.2015;43(7):e47.doi:10.1093/nar/gkv007.

56. SmythGK.Linearmodelsandempiricalbayesmethodsforassessingdifferentialexpressioninmicroarrayexperiments.StatApplGenetMolBiol.2004;3:Article3.doi:10.2202/1544-6115.1027.

57. PimentelHJ,BrayN,PuenteS,MelstedP,PachterL.DifferentialanalysisofRNA-Seqincorporatingquantificationuncertainty.bioRxiv.2016.doi:10.1101/058164.

58. PérezF,GrangerBE.{IP}ython:aSystemforInteractiveScientificComputing.ComputSciEng.2007;9(3):21-29.doi:10.1109/MCSE.2007.53.

59. Jupyter.http://jupyter.org/.Published2016.AccessedJanuary8,2016.60. RStudio.RStudio:IntegrateddevelopmentenvironmentforR(Version

0.97.311).JWildlManage.2011;75(8):1753-1766.doi:10.1002/jwmg.232.61. BaumerB,Cetinkaya-RundelM,BrayA,LoiL,HortonNJ.RMarkdown:

IntegratingAReproducibleAnalysisToolintoIntroductoryStatistics.TechnolInnovStatEduc.2014;8(1):20.doi:10.5811/westjem.2011.5.6700.

62. FriedrichLeisch.Sweave:Dynamicgenerationofstatisticalreportsusingliteratedataanalysis.Compstat2002-ProcComputStat.2002;(69):575-580.doi:10.1.1.20.2737.

63. Beaulieu-JonesBK,GreeneCS.Semi-SupervisedLearningoftheElectronicHealthRecordwithDenoisingAutoencodersforPhenotypeStratification.bioRxiv.February2016.http://biorxiv.org/content/early/2016/02/18/039800.abstract.

64. SouilmiY,LancasterAK,JungJ-Y,etal.Scalableandcost-effectiveNGSgenotypinginthecloud.BMCMedGenomics.2015;8(1):64.doi:10.1186/s12920-015-0134-9.

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted August 11, 2016. ; https://doi.org/10.1101/056473doi: bioRxiv preprint