24
Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…

Model of a real workflow

  • Upload
    kiefer

  • View
    25

  • Download
    2

Embed Size (px)

DESCRIPTION

Model of a real workflow. A subset of the plasmodb pipeline (in progress!) And issues to discuss…. PlasmoDB workflow. P.Falciparum Standard genome. P.Vivax Standard genome. P.Yoelli Standard genome. P.Berghei Standard genome. P.Chabaudi Standard genome. P.Knowlesi Standard - PowerPoint PPT Presentation

Citation preview

Model of a real workflow

A subset of the plasmodb pipeline

(in progress!)

And issues to discuss…

PlasmoDB workflow

P.FalciparumStandardgenome

P.GallenaciumStandardgenome

P.VivaxStandardgenome

P.YoelliStandardgenome

P.BergheiStandardgenome

P.ChabaudiStandardgenome

P.KnowlesiStandardgenome

P.ReichonowiStandardgenome

P.FalciparumNon-standard

synteny

Standard Genome Workflow

blastxNrdb

genome

Splign

In: Pf, Pb, Py, Pv

blastpnrdb proteins

molecularweight

Isolelectricpoint

molecularWeight

Min/max

psipred

run TMHMM

Load TMHMM

taxonomy SONRDB

Genome

TIGR TGI

Extract proteins

Extract genomicsequence

Copy proteinsTo cluster

Copy genomicseqs

To cluster

Global steps(oval)

Subflows(double line)

Compile timeInclude/ExcludeCalculate

Translatedprotein

In: Pf, Pk

Standard Genome Workflow

blastxNrdb

genome

Splign

In: Pf, Pb, Py, Pv

blastpnrdb proteins

CalculateTranslated

protein

In: Pf, Pk

molecularweight

Isolelectricpoint

molecularWeight

Min/max

psipred

run TMHMM

Load TMHMM

taxonomy SONRDB

Genome

TIGR TGI

Extract proteins

Extract genomicsequence

Copy proteinsTo cluster

Copy genomicseqs

To cluster

NRDB

Copy from downloadsite

Shorten defline

NRDB resource

Copy to clusterCopy to cluster

Resources

acquire

unpack

ext db

Ext db rls

insert

Psipred

fix protein IDsFor psipred

create psipredTask dir

copy Data Dirto cluster

copy psipredProtein fileto cluster

start psipredOn cluster

wait for cluster

copy psipredFiles from

cluster

fix psipredFile names

make Alg Inv

load psipred

create psipredData dir

BLAST

CreateSimilarity dir

Start blast

Wait for cluster

Copy files From cluster

extract IDsFrom Blast

result

Load Subjectsubset

Load Result

Optional step(runtime test)

Splign

runSplign

Extract subjectSequenceAlt defline

insertSplign

Extract querySequenceAlt defline

Issues

Steps

• Subflows– Parameters– Constants– Interpolating variables

• Global steps– Steps that are only executed once by the whole workflow, even if in multiple

subflows– Declare a namespace?

• Include/exclude– Compile time inclusion/exclusion– If not compiled in, flow passes right through

• Skip-able steps– Runtime exclusion, based on a dynamic test

Step Values

• Avoid side effects in file system (ok in database)– All files shared by steps must be passed as param values

• outputFiles• inputFiles

• Avoid hard-coded values– Use Constants

• Avoid hand-coded values that change each build– Must be computed by step– Eg blast Y= value

• External Db Rls values– Always pass external db rls spec, eg

• Plasmodium Falciparum Chromosomes:2008-07-13

– Upgrade steps to conform to this

• Table names– Want to be able to reuse these values across steps– Always use same format, eg:

• Dots.ExternalNaSequence

Cluster

• Wait for cluster step– Sends email– (takes list of email addresses as config. Maybe we should set up mailing list?)

• Followed by a waitForHuman step. – By default is in “WAIT_FOR_HUMAN” state

• Orthogonal to other states and offline status

– Pilot can turn that off, and it will run

Configuration

• Steps Configuration– Global

• Commonly used properties

• Not validated until runtime

– Static• Defined per step class

• Convenient, often all is necesssary

– Cascading?– Multi-steps file

• Distinguish between stable properties and mutable ones– Version numbers often change

• Svn

• Pilot configuration?

File & Directory Structure

• Avoid side-effects• Use explicit input/ouput params in xml file• Move to a nested data directory structure?

/files/cbil/data/cbil/Plasmodb/5.5/workflow/data/Seqfiles/

nrdb.fsaPvivax/

Seqfiles/Psipred/Assembly/

ESTs/Initial/Intermediate/

– Would use the namespace attribute, somehow• Use path statement, eg:

– ../– ../tmhmm

• Steps directories– Use nested structure for subflows?

GUI

• Should it run in the web context?– Security issues– Avoids having to have installed software– Would work from home– All members of team could see the flow– Somehow restrict editability– Could be posted on real site as documentation?

• Overkill? Too detailed?

• Needs to handle subflows– Subflow node needs to show a summary of what is going on inside the subflow

• Multi-colored, to show various states inside it

• Gray out paths that are offline

• Expand/collapse?

Resource Pipeline

• Not worked out yet

• Needs to be handled by regular subflow

• Unpacks will need to be collapsed into a single unpack script

• Resources.xml file as needed by front end can be produced by a documentation run of the pipeline

• Does it need to be configured in xml, or would a properties file be good enough?

Documentation of the workflow

• Workflow must be able to run in “documentation” mode– Doesn’t run any steps– Instead, produces documentation as expected by front end

• Methods xml file

• Resources xml file

Slides after this are notes, and other junk

Standard resources

taxonomy EnzymeDBSO NRDB dbEST[tax_id]

GOGO Codes

BibliographicRef terms MO terms MO types MO InterProMO Entry

Orthomclphyletic

orthomcl

Plasmodb resources

IEDBepitopes

IEDBdbxrefs

NA Genbankdbrefs

AA Genbankdbrefs pdb Pdb index

P.falciparum resources

ZhangESTs

ApicopolastFlorens

2002

Pf plastidFlorentESTs

Pf mitochonWatanabe Pf

transcriptsWatanabe Pf

ESTsPf GO

AssociationsSanger IT

SNPsSU SNPs Broad SNPs

CombinedSNPs

DeRisiOligos

WinzelerGenetic Var.

array

DeRisiDd2

DeRisiHB3

WinzelerCell Cycle

DeRisi3D7

ScrippsArray

WinzelerGametocyte

DeRisiArray7282

MTC KIArray

BaumMeta data

DurasinghMeta data

GSE5247Meta data

CowmanMeta data

Pfab Array

E-MEXP 449Meta data

E-MEXP 439Meta data

PlasmodbGene ids

E-MEXP 128Meta data

WatersMeta data

WatersGametocyteMass spec

DailyMeta data

GSE2265Meta data

GSE8099Meta data

interactomeWatersFemale

Gametes mass

Mutual info

Plasmo mapy2hSage tag

Array design

Sage tag freqsPf chr

Genbank refs

TIGR geneindexes

BaumArray data

DurasinghArray data

GSE5247Array data

CowmanArray data

E-MEXP 449Array data

E-MEXP 439Array data

E-MEXP 128Array data

WatersArray data

DailyArray data

GSE2265array data

GSE8099Array data

BaumRAD anal

DurasinghRAD anal

GSE5247RAD anal

CowmanRAD anal

E-MEXP 449RAD anal

E-MEXP 439RAD anal

E-MEXP 128RAD anal

WatersRAD anal

DailyRAD anal

GSE2265RAD anal

GSE8099RAD anal

Watersmale

Gametes mass

Watersmixed

Gametes mass

PASADb refs

HagaiEC

WinzelerDb refs

WinzelerLit refs

PredictedProteinstructs

mr4

Cowmansubcellular

Haldarsubcellular

Merozoitepeptides

lasonderoocycts

Florens2004

Broad SNPcoverage

eviganLasonderOocycts

sporozoitesEntrezDbrefs

Pubmeddbrefs

Broad bar codeBroad 3k

genotyping

Lasondersalivary

sporozoites

P. vivax resources

Watanabe Pvtranscripts

Pv contigsWatanabe Pv

ESTsPv dbrefs Pv GB dbrefs Pv mitochon

Pv chromosomes

TIGR geneindexes

C.parvum C.hominis

Synteny

start

End

Plasmo Toxo

Api

End

Start