26
Enabling Reproducible NGS Analysis Through Automated Jupyter Pipelines Amanda Birmingham Senior Bioinformatics Engineer Center for Computational Biology & Bioinformatics, UCSD

Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

EnablingReproducibleNGSAnalysisThroughAutomatedJupyter PipelinesAmandaBirmingham

Senior BioinformaticsEngineer

CenterforComputationalBiology&Bioinformatics, UCSD

Page 2: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

ReproducibleResearch• Repeatability&reproducibilityarekeytothescientificmethod

◦ In1663,onlyRobertBoyleandChristiaanHuygenscouldproducea

vacuum—andtheirfindingsdidn’tagree

• Informaticsshould beattheforefrontofreproducibleresearch◦ Doingthesamethingoverandoveriswhatcomputersdobest!

◦ Buthastakenalongtimeformethodsreportsforcomputational

worktobecomeasgoodasthoseforwetlabwork

◦ Ex:Proc Natl Acad Sci USA.1986Jun;83(11):3746-50

Page 3: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

ReproducibleResearch• Repeatability&reproducibilityarekeytothescientificmethod

◦ In1663,onlyRobertBoyleandChristiaanHuygenscouldproducea

vacuum—andtheirfindingsdidn’tagree

• Informaticsshould beattheforefrontofreproducibleresearch◦ Doingthesamethingoverandoveriswhatcomputersdobest!

◦ Buthastakenalongtimeformethodsreportsforcomputational

worktobecomeasgoodasthoseforwetlabwork

◦ Ex:Proc Natl Acad Sci USA.1986Jun;83(11):3746-50

◦ Progress:

§ “Alignmentswererun”

§ “AlignmentswererunwithBLAST”

§ “AlignmentswererunwithBLASTNversion2.2.6againsthuman”

§ “Alignmentswere runwithNCBIBLASTNv.2.2.9usingthecommand blastn -W 7 -q -1 -F F againsttheNCBIRefSeq release80humantranscriptome”

• Paritywithwet-labmethodsshouldn’tbetheendoftheroad!

Page 4: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter?• WhatIsJupyter?

◦ "Opensource,interactivedatascienceandscientificcomputingacrossover40programminglanguages”

§ GrewoutoftheIPython project,whichstartedin2001whenDr.FernandoPerezwasprocrastinatingonhisPhysicsPhD:)

◦ A"literatecomputing"environment,"weavingofanarrativedirectlyintoalivecomputation,

interleavingtextwithcodeandresultstoconstructacompletepiece"--FernandoPerez

• Computingplatformisnamed"jupyter"becauseearlylanguageswerejulia,python,andR

◦ Community-maintainedkernelsforotherlanguages: Bash,C,C++,C#,Fortran,Go,Haskell,Javascript,

Lisp,Mathematica,Matlab,Perl,PHP,Powershell,Ruby,SAS,Scala,Scheme,andmanymore

• Mostwell-knownforaweb-based“notebook”system

◦ Allowswriting&runningofcodefrombrowserenvironment

◦ CanmixinHTML,links,images,interactivecontrols,extensions

Jupyter logocourtesyofhttp://jupyter.org/

Page 5: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 6: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 7: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 8: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 9: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 10: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 11: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 12: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 13: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 14: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 15: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

WhatIsJupyter,Really?

Page 16: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

Jupyter Notebooks:FriendorFoe?• Arenotebooksthekeytoreproducibility?◦ DataCarpentryoffersanentireworkshopon

"ReproducibleResearchusingJupyter

Notebooks”

• Easytosave,modify,andextend

◦ Greatforrerunningortweakingpreviousdata

analyses

• CCBBdeliversanalysesasnotebooks◦ Reportbecomesmorethanarecord—itisitselfa

tool!

• Notebooks’greateststrengthisinteractivity◦ Betweeninputandoutput

◦ Between(e.g.)PythonandR

◦ Betweennarrativeandcode

◦ Betweenmaterialandreader

Page 17: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility

• Humansareinconsistent

◦ Wemakeunpredictablemistakes

◦ Thus,“interactive”=“bad”forrepetitivetasks

§ LikeprimaryNGSanalysispipelines

• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes

NOTreruncellsthatdependonthatchange

◦ Infact,doesn’tevenclearoldoutputs!

INEEVRMAKETYPOS!

Page 18: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility

• Humansareinconsistent

◦ Wemakeunpredictablemistakes

◦ Thus,“interactive”=“bad”forrepetitivetasks

§ LikeprimaryNGSanalysispipelines

• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes

NOTreruncellsthatdependonthatchange

◦ Infact,doesn’tevenclearoldoutputs!

INEEVRMAKETYPOS!

Page 19: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility

• Humansareinconsistent

◦ Wemakeunpredictablemistakes

◦ Thus,“interactive”=“bad”forrepetitivetasks

§ LikeprimaryNGSanalysispipelines

• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes

NOTreruncellsthatdependonthatchange

◦ Infact,doesn’tevenclearoldoutputs!

INEEVRMAKETYPOS!

◦ Thus,“interactive”=“bad”forimportantrecords

§ Likeexperimentalrecords(i.e.,methods)

• DowehavetogiveupotheradvantagesofJupyter Notebookswhenbuildingpipelines

andrecordingmethods?

Page 20: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

• No!Wecanhaveourcakeandeatit,tooJ

• Jupyter shipswithnbconvert packagethatcanread,write,andexecutenotebooksfromPython

• Anextension,nbparameterise (noteBritishspelling)allowsinjectionofnewvariablevalues

• nbconvert andnbformat (alsobuilt-in)canoutputnotebooksandstatichtml,respectively

• Withthesethreepieces,wecanscriptpipelinesbuiltfromJupyter Notebooks

◦ Notebooksgivereadabilityandreusability

◦ Scriptpreventshumanerrorsandspeedsexecution

◦ HTMLoutputofnotebooksprovidesread-onlyrecordofmethods

• Entireapproachtakeslessthanonepageofcode

ScriptingJupyter Notebooks

Page 21: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

ScriptingJupyter Notebooks

Page 22: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

• No!Wecanhaveourcakeandeatit,tooJ

• Jupyter shipswithnbconvert packagethatcanread,write,andexecutenotebooksfromPython

• Anextension,nbparameterise (noteBritishspelling)allowsinjectionofnewvariablevalues

• nbconvert andnbformat (alsobuilt-in)canoutputnotebooksandstatichtml,respectively

• Withthesethreepieces,wecanscriptpipelinesbuiltfromJupyter Notebooks

◦ Notebooksgivereadabilityandreusability

◦ Scriptpreventshumanerrorsandspeedsexecution

◦ HTMLoutputofnotebooksprovidesread-onlyrecordofmethods

• Entireapproachtakeslessthanonepageofcode

ScriptingJupyter Notebooks

Page 23: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

NotebooksintheWild• AsampleNGSpipelineusingJupyter Notebooks

◦ Goal:identifygenepairswithsynergisticsurvivaleffects(positiveornegative)

◦ Experimentalsystem::Dual-geneknock-outsinhumancelllinesusingCRISPR

◦ Read-out:numberofinstancesofeachCRISPRguideinfinalpopulation,assessedbyNGS

Scaffold

Trimming

Pair

Filtration

Pair

Counting

Count

Visualization

Count

Combination

Jupyter logocourtesyofhttp://jupyter.org/

Page 24: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

NotebooksintheWild• AsampleNGSpipelineusingJupyter Notebooks

◦ Goal:identifygenepairswithsynergisticsurvivaleffects(positiveornegative)

◦ Experimentalsystem::Dual-geneknock-outsinhumancelllinesusingCRISPR

◦ Read-out:numberofinstancesofeachCRISPRguideinfinalpopulation,assessedbyNGS

Page 25: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

Conclusions• Jupyter Notebooksareafantastictoolfordataanalysis—but:

• Theirtwingoalsofinteractivityandreproducibilityareoftenatodds

• Notebookscanbescriptedtoreduceerrorpotential◦ Andnotebook-basedpipelinesself-documentnicely!

• CCBBhasimplementedasampleJupyter-basedpipelineforNGSdatafromdualCRISPRscreens

◦ PipelineispartofworkwithDr.s PrashantMali&TreyIdeker, nowinpressatNatureMethods

◦ Codeisavailableinthe“CRISPR”sectionofCCBB's jupyter-genomicsrepositoryonGitHub

§ https://github.com/ucsd-ccbb/jupyter-genomics

• CCBB’sDataScienceBloggivesafurtherintrotonotebookscripting◦ http://ccbb.bio/outreach/data-science-blog/

• Reproducibledataanalysisishardwork—butworththeeffort!http://ccbb.bio

Page 26: Enabling Reproducible NGS Analysis Through Automated JupyterPipelinescompbio.ucsd.edu/wp-content/uploads/2016/10/20170206... · 2017-07-19 · Reproducible Research •Repeatability

Acknowledgments• FernandoPerez&theJupyterProject!

• DualCRISPRTeam◦ Malilab

◦ Idekerlab

• CCBBTeam◦ KatieFisch (Director)

◦ RomanSasik

◦ Guorong Xu

◦ Brin Rosenthal

• Ourfunders◦ UCSanDiegoHealthSciences

◦ CTRICenterforAcceleratingDrugDevelopment

(CADD)– GrantUL1TR001442