BPIPE: a bioinformatics pipeline framework

BPIPE: BIOINFORMATICS PIPELINE FRAMEWORK

Speaker: Mohamed Nadhir Djekidel (那弟尔 )2015/11/06

WHY WE NEED PIPELINES

➤ Bioinformatics analysis is generally a set steps.

➤ In some analysis we need a combination of tools (bowtie, samtools,…etc)

➤ Some tasks are repetitive (especially if we have many files).

➤ Need to edit the script if the program crush in the middle

➤ Some time we have hard coded scripts that are not portable➤ …..

MOTIVATIONS BEHIND PIPE➤ dedicated programming language for defining and

executing bioinformatics pipelines➤ No much programmable skills are needed ➤ Simple definition of tasks➤ easy restart of the job from the point of failure➤ Easy Parallelism and job sequence management➤ Integration with Cluster Resource Managers ( GSE, PBS,

LSF)➤ Modular development of re-usable pipeline stages.➤ Automatic logging

BPIP’S ARCHITECTURE

➤ BPIPE Language: Based on Groovy, but shell scripting in generally ok.

➤ The Bpipe Job Management Tool: BASH Shell + Java

➤ Log management : creates .bpipe folder

➤ Communication with Resource Managers: sending jobs to the queue,…etc

BASIC BPIPE STRUCTURES

stage_one

stage_two

CONVERT A SHELL SCRIPT TO BPIPE Original BASH script

BPIPE Script

DYNAMIC INPUT AND OUTPUTUsed the variables $input and $output instead

PARALLEL TASKSUse brackets {}, to specify parallel tasks

step1

step2 step3

step1

step2 step4

step3 step5

PARALLEL TASKS -CONT

step1

step2 step4

step3 step5

step6 (Step6 will wait until both branches are finished)

RUN ON A CLUSTER

➤ create a pipe.config file in you working directory➤ select the SGE system and specify configuration

(optional)

PIPELINE REPORT

A file index.html will be generated in the doc folder

INPUT SPLIT➤ Inputs can be grouped using regular expressions

➤ * used as a general selector and it affects the ordering

➤ % used for splitting

Example

INPUT SPLIT - EXAMPLESInput

The script

Default parameters

INPUT SPLIT - EXAMPLES

Pass individual files

Order alphabetically

Group files

CONTROLLING OUTPUT NAMINGFilter : Keeps the same extension and adds the filter

file.csv file.nocomments.csv

Transform : changes the extension

file.csv file.xml

file.fast.gz file_fast.zip

CONTROLLING OUTPUT NAMINGProduce : produces an output file with the specified name

RUNNING R CODE

SOME COMMANDS

ADDING INFORMATION TO THE SCRIPT

USEFUL TUTORIALS➤ Download bpipe: https://github.com/ssadedin/bpipe

➤ Documentation: http://docs.bpipe.org/

➤ A complete workshop: https://github.com/tucano/bpipe_workshop

➤ Paper : http://bioinformatics.oxfordjournals.org/content/28/11/1525.full

https://github.com/ssadedin/bpipe

http://docs.bpipe.org/

https://github.com/tucano/bpipe_workshop

http://bioinformatics.oxfordjournals.org/content/28/11/1525.full

THANKS

Science

BPIPE: a bioinformatics pipeline framework