Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple...

Preview:

Citation preview

Building bioinformatic pipelines

6/20/2019

P.Zumbo

What is a pipeline?

Apipelineorworkflowreferstoaseriesofprocessingstepssuchthatoutputofeachprocessistheinputofthenext,typicallydonetotransformrawdataintosomethingmoreinterpretable.

Why bother building pipelines?

1. Reproducibility2. Dataprovenance3. Automation4. Transparency

Pipelines aid in reproducibility

Reproducibility=obtainingthesameresult*usingthesamecodeanddata*withinreason(e.g.,somealignersassignmulti-mappingreadstoarandomlocation)

Data provenance contextualizes results

Provenancereferstothedescriptionoftheoriginofapieceofdata• Thestepstakentoarriveatapieceofdata• Thesoftwareused• Theversionofthesoftwareused• Theargumentssuppliedtothesoftwareused

Automation: the amount of data keeps increasing

StephensZD,LeeSY,FaghriF,CampbellRH,ZhaiC,etal.(2015)BigData:AstronomicalorGenomical?.PLOSBiology13(7):e1002195.https://doi.org/10.1371/journal.pbio.1002195http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Automation: some pipelines complex

From:AreviewofbioinformaticpipelineframeworksBriefBioinform.2016;18(3):530-536.doi:10.1093/bib/bbw020

Simple alignment pipeline with bowtie2

#alignreadswithbowtie2bowtie2-xref.fa–Ushort_read.fq>aln-se.sam#convertfromsamtobamsamtoolsview-bSaln-se.sam>aln-se.bam#sortbamfilesamtoolssortaln-se.bam>aln-se.sorted.bam

Simple sample script

#!/usr/bin/envbash##toolsBOWTIE=/usr/local/bin/bowtie2#v2.3.5.1SAMTOOLS=/usr/local/bin/samtools#v1.9##referencegenomeREFERENCE=/usr/local/ref/e_coli.fa$BOWTIE-x$REFERENCE-UA.fastq.gz>A.sam$SAMTOOLSview-bSA.sam>A.bam$SAMTOOLSsortA.bam>A.sorted.bam

For loops #!/usr/bin/envbashBOWTIE=/usr/local/bin/bowtie2SAMTOOLS=/usr/local/bin/samtoolsREFERENCE=/usr/local/ref/e_coli.faforreadin$(ls*fastq.gz);do

$BOWTIE-x$REFERENCE-U$read>${read/.fastq.gz/.sam}$SAMTOOLSview-bS${read/.fastq.gz/.sam}>${read/.fastq.gz/.bam}$SAMTOOLSsort${read/.fastq.gz/.bam}>${read/.fastq.gz/.sorted.bam}

done

GNU parallel

toolforprocessingrepetitivecommandsparallel[options][command[arguments]]:::<files>• :::<files>orfind<files>|• Thefilename:{}• Thefilenamewiththeextensionremoved:{.}e.g.test.fawouldbecometest• --jobs,-jn

GNU parallel pipeline

THREADS=2parallel--jobs$THREADSgunzip{}:::*fastq.gzparallel--jobs$THREADS$BOWTIE-x$REFERENCE-U{}">"{.}.sam:::*fastqparallel--jobs$THREADS$SAMTOOLSview-bS{.}.sam">"{.}.bam:::*samparallel--jobs$THREADS$SAMTOOLSsort{.}.sam">"{.}.sorted.bam:::*bam

A brief history of make

• firstintroducedbyStuartFeldmanin1977atBellLabs• buildautomationtool• usedtobuildexecutableprogramsandlibrariesfromsourcecode• however,makeisnotlimitedtobuildingbinariesandlibraries

Key features of make

• Dependencyanalysis• Re-entrancy• Parallelization• Patternrules/abstraction• Audittrail

what is make?

makeisaprogramthatreadsamakefileandthatbuildsoneormorefilesfromzeroormoreotherfilesthattheydependon.

how does make do what it does?

makeparsesthemakefile,buildsadependencytree(bydeterminingtherelationshipsbetweentheinputsandoutputs),andthentraverseseachbranchofthetree,executingcommandsalongtheway.

what is a makefile?

amakefileisatextfilewhichcontainsrulesforhowtocreateasetoftargetfiles.

what is a rule?

aruletellsmakewhichseriesofcommandstoexecuteandwhatfilesmustexistbeforehandinordertocreateasetoftargetsfromsomeinput.

the general form of a rule is:

target … : dependency … command … …

a practical example: alignment

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.sam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

adding another step:

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.bam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

A.bam: A.sam

$(SAMTOOLS) view –bS A.sam > A.bam

automatic variables

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.bam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) –U $< > $@

A.bam: A.sam

$(SAMTOOLS) view –bS $< > $@

using pattern rules: the percent sign

%:roughlyequivalentto*inaUnixshell-representsanynumberofanycharacters-canbeplacedanywherewithinpattern-canonlyoccuronce

somevaliduses:%.vs%.owrapper_%-charactersotherthan%matchliterallywithinafilename

revisiting alignment…

FASTQFILES := $(wildcard *.fastq.gz)

all: $(FASTQFILES:.fastq.gz=.sorted.bam)

%.sam: %.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

%.bam: %.sam

$(SAMTOOLS) view -bS $< > $@

%.sorted.bam: %. bam

$(SAMTOOLS) sort $< > $@

visualizing the dependency tree

default

Sample1.bam Sample2.bam

Sample2.fastq.gzSample1.fastq.gz

makefile

Sample1.sam Sample2.sam

the -j switch

-j[jobs],--jobs[=jobs]specifiesthenumberofjobs(commands)torunsimultaneously.

why make? the limits of a script:

1.  linearexecution•  make-j

2.  truncatedfiles•  .DELETE_ON_ERROR:

3.  unabletoresume•  make

4.  pooraudittrail•  make-nB>make.log

Limitations of make

• Wasn’tdesignedforbioinformaticanalyses• Syntaxrequiresunderstandingrulestructure• Lackssupportformultipleoutputsfromsinglecommand• Nosupportformultiplewildcardspername• Nobuilt-insupportfordistributedcomputing

Ways to parallelize

ImageFrom:http://cloudcomputingnet.com/category/clouldcomputing/grid-computing/

Singlecomputer,singlecore

Singlecomputer,multiplecores Multiplecomputers,

multiplecores

Future trends

ImageFrom:https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/#foobox-3/0/Singularity-architecture_G-Kurtzer-e1477021972985.jpgSingularitycontainers

Many contemporary alternatives to make

https://github.com/pditommaso/awesome-pipeline

CWL

From:https://www.commonwl.org/user_guide/02-1st-example/

Pipelines tip of iceberg concerning reproducibility

From:ExperimentingwithreproducibilityinbioinformaticsYang-MinKim,Jean-BaptistePoline,GuillaumeDumasbioRxiv143503;doi:https://doi.org/10.1101/143503

Recommended