Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
EnablingReproducibleNGSAnalysisThroughAutomatedJupyter PipelinesAmandaBirmingham
Senior BioinformaticsEngineer
CenterforComputationalBiology&Bioinformatics, UCSD
ReproducibleResearch• Repeatability&reproducibilityarekeytothescientificmethod
◦ In1663,onlyRobertBoyleandChristiaanHuygenscouldproducea
vacuum—andtheirfindingsdidn’tagree
• Informaticsshould beattheforefrontofreproducibleresearch◦ Doingthesamethingoverandoveriswhatcomputersdobest!
◦ Buthastakenalongtimeformethodsreportsforcomputational
worktobecomeasgoodasthoseforwetlabwork
◦ Ex:Proc Natl Acad Sci USA.1986Jun;83(11):3746-50
ReproducibleResearch• Repeatability&reproducibilityarekeytothescientificmethod
◦ In1663,onlyRobertBoyleandChristiaanHuygenscouldproducea
vacuum—andtheirfindingsdidn’tagree
• Informaticsshould beattheforefrontofreproducibleresearch◦ Doingthesamethingoverandoveriswhatcomputersdobest!
◦ Buthastakenalongtimeformethodsreportsforcomputational
worktobecomeasgoodasthoseforwetlabwork
◦ Ex:Proc Natl Acad Sci USA.1986Jun;83(11):3746-50
◦ Progress:
§ “Alignmentswererun”
§ “AlignmentswererunwithBLAST”
§ “AlignmentswererunwithBLASTNversion2.2.6againsthuman”
§ “Alignmentswere runwithNCBIBLASTNv.2.2.9usingthecommand blastn -W 7 -q -1 -F F againsttheNCBIRefSeq release80humantranscriptome”
• Paritywithwet-labmethodsshouldn’tbetheendoftheroad!
WhatIsJupyter?• WhatIsJupyter?
◦ "Opensource,interactivedatascienceandscientificcomputingacrossover40programminglanguages”
§ GrewoutoftheIPython project,whichstartedin2001whenDr.FernandoPerezwasprocrastinatingonhisPhysicsPhD:)
◦ A"literatecomputing"environment,"weavingofanarrativedirectlyintoalivecomputation,
interleavingtextwithcodeandresultstoconstructacompletepiece"--FernandoPerez
• Computingplatformisnamed"jupyter"becauseearlylanguageswerejulia,python,andR
◦ Community-maintainedkernelsforotherlanguages: Bash,C,C++,C#,Fortran,Go,Haskell,Javascript,
Lisp,Mathematica,Matlab,Perl,PHP,Powershell,Ruby,SAS,Scala,Scheme,andmanymore
• Mostwell-knownforaweb-based“notebook”system
◦ Allowswriting&runningofcodefrombrowserenvironment
◦ CanmixinHTML,links,images,interactivecontrols,extensions
Jupyter logocourtesyofhttp://jupyter.org/
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
WhatIsJupyter,Really?
Jupyter Notebooks:FriendorFoe?• Arenotebooksthekeytoreproducibility?◦ DataCarpentryoffersanentireworkshopon
"ReproducibleResearchusingJupyter
Notebooks”
• Easytosave,modify,andextend
◦ Greatforrerunningortweakingpreviousdata
analyses
• CCBBdeliversanalysesasnotebooks◦ Reportbecomesmorethanarecord—itisitselfa
tool!
• Notebooks’greateststrengthisinteractivity◦ Betweeninputandoutput
◦ Between(e.g.)PythonandR
◦ Betweennarrativeandcode
◦ Betweenmaterialandreader
(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility
• Humansareinconsistent
◦ Wemakeunpredictablemistakes
◦ Thus,“interactive”=“bad”forrepetitivetasks
§ LikeprimaryNGSanalysispipelines
• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes
NOTreruncellsthatdependonthatchange
◦ Infact,doesn’tevenclearoldoutputs!
INEEVRMAKETYPOS!
(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility
• Humansareinconsistent
◦ Wemakeunpredictablemistakes
◦ Thus,“interactive”=“bad”forrepetitivetasks
§ LikeprimaryNGSanalysispipelines
• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes
NOTreruncellsthatdependonthatchange
◦ Infact,doesn’tevenclearoldoutputs!
INEEVRMAKETYPOS!
(Inter-)ActivelyDangerous• Interactivitycanalsobeahugedangertoreproducibility
• Humansareinconsistent
◦ Wemakeunpredictablemistakes
◦ Thus,“interactive”=“bad”forrepetitivetasks
§ LikeprimaryNGSanalysispipelines
• Jupyter Notebookscanbeinconsistent,too◦ Changingcode/variablesinanotebookdoes
NOTreruncellsthatdependonthatchange
◦ Infact,doesn’tevenclearoldoutputs!
INEEVRMAKETYPOS!
◦ Thus,“interactive”=“bad”forimportantrecords
§ Likeexperimentalrecords(i.e.,methods)
• DowehavetogiveupotheradvantagesofJupyter Notebookswhenbuildingpipelines
andrecordingmethods?
• No!Wecanhaveourcakeandeatit,tooJ
• Jupyter shipswithnbconvert packagethatcanread,write,andexecutenotebooksfromPython
• Anextension,nbparameterise (noteBritishspelling)allowsinjectionofnewvariablevalues
• nbconvert andnbformat (alsobuilt-in)canoutputnotebooksandstatichtml,respectively
• Withthesethreepieces,wecanscriptpipelinesbuiltfromJupyter Notebooks
◦ Notebooksgivereadabilityandreusability
◦ Scriptpreventshumanerrorsandspeedsexecution
◦ HTMLoutputofnotebooksprovidesread-onlyrecordofmethods
• Entireapproachtakeslessthanonepageofcode
ScriptingJupyter Notebooks
ScriptingJupyter Notebooks
• No!Wecanhaveourcakeandeatit,tooJ
• Jupyter shipswithnbconvert packagethatcanread,write,andexecutenotebooksfromPython
• Anextension,nbparameterise (noteBritishspelling)allowsinjectionofnewvariablevalues
• nbconvert andnbformat (alsobuilt-in)canoutputnotebooksandstatichtml,respectively
• Withthesethreepieces,wecanscriptpipelinesbuiltfromJupyter Notebooks
◦ Notebooksgivereadabilityandreusability
◦ Scriptpreventshumanerrorsandspeedsexecution
◦ HTMLoutputofnotebooksprovidesread-onlyrecordofmethods
• Entireapproachtakeslessthanonepageofcode
ScriptingJupyter Notebooks
NotebooksintheWild• AsampleNGSpipelineusingJupyter Notebooks
◦ Goal:identifygenepairswithsynergisticsurvivaleffects(positiveornegative)
◦ Experimentalsystem::Dual-geneknock-outsinhumancelllinesusingCRISPR
◦ Read-out:numberofinstancesofeachCRISPRguideinfinalpopulation,assessedbyNGS
Scaffold
Trimming
Pair
Filtration
Pair
Counting
Count
Visualization
Count
Combination
Jupyter logocourtesyofhttp://jupyter.org/
NotebooksintheWild• AsampleNGSpipelineusingJupyter Notebooks
◦ Goal:identifygenepairswithsynergisticsurvivaleffects(positiveornegative)
◦ Experimentalsystem::Dual-geneknock-outsinhumancelllinesusingCRISPR
◦ Read-out:numberofinstancesofeachCRISPRguideinfinalpopulation,assessedbyNGS
Conclusions• Jupyter Notebooksareafantastictoolfordataanalysis—but:
• Theirtwingoalsofinteractivityandreproducibilityareoftenatodds
• Notebookscanbescriptedtoreduceerrorpotential◦ Andnotebook-basedpipelinesself-documentnicely!
• CCBBhasimplementedasampleJupyter-basedpipelineforNGSdatafromdualCRISPRscreens
◦ PipelineispartofworkwithDr.s PrashantMali&TreyIdeker, nowinpressatNatureMethods
◦ Codeisavailableinthe“CRISPR”sectionofCCBB's jupyter-genomicsrepositoryonGitHub
§ https://github.com/ucsd-ccbb/jupyter-genomics
• CCBB’sDataScienceBloggivesafurtherintrotonotebookscripting◦ http://ccbb.bio/outreach/data-science-blog/
• Reproducibledataanalysisishardwork—butworththeeffort!http://ccbb.bio
Acknowledgments• FernandoPerez&theJupyterProject!
• DualCRISPRTeam◦ Malilab
◦ Idekerlab
• CCBBTeam◦ KatieFisch (Director)
◦ RomanSasik
◦ Guorong Xu
◦ Brin Rosenthal
• Ourfunders◦ UCSanDiegoHealthSciences
◦ CTRICenterforAcceleratingDrugDevelopment
(CADD)– GrantUL1TR001442