27
Model of a real workflow And issues to discuss…

Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Model of a real workflow

And issues to discuss…

Page 2: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

PlasmoDB workflow

P.FalciparumStandardgenome

P.GallenaciumStandardgenome

P.VivaxStandardgenome

P.YoelliStandardgenome

P.BergheiStandardgenome

P.ChabaudiStandardgenome

P.KnowlesiStandardgenome

P.ReichonowiStandardgenome

P.FalciparumNon-standard

synteny

Page 3: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Standard Genome Workflow

blastxNrdb

genome

Splign

In: Pf, Pb, Py, Pv

blastpnrdb proteins

molecularweight

Isolelectricpoint

molecularWeight

Min/max

psipred

run TMHMM

Load TMHMM

taxonomy SONRDB

Genome

TIGR TGI

Extract proteins

Extract genomicsequence

Copy proteinsTo cluster

Copy genomicseqs

To cluster

Global steps(oval)

Subflows(double line)

Compile timeInclude/ExcludeCalculate

Translatedprotein

In: Pf, Pk

Page 4: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Standard Genome Workflow

blastxNrdb

genome

Splign

In: Pf, Pb, Py, Pv

blastpnrdb proteins

CalculateTranslated

protein

In: Pf, Pk

molecularweight

Isolelectricpoint

molecularWeight

Min/max

psipred

run TMHMM

Load TMHMM

taxonomy SONRDB

Genome

TIGR TGI

Extract proteins

Extract genomicsequence

Copy proteinsTo cluster

Copy genomicseqs

To cluster

Page 5: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

NRDB

Copy from downloadsite

Shorten defline

NRDB resource

Copy to clusterCopy to cluster

Page 6: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Resources

acquire

unpack

ext db

Ext db rls

insert

Page 7: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Psipred

fix protein IDsFor psipred

create psipredTask dir

copy Data Dirto cluster

copy psipredProtein fileto cluster

start psipredOn cluster

wait for cluster

copy psipredFiles from

cluster

fix psipredFile names

make Alg Inv

load psipred

create psipredData dir

Page 8: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

BLAST

CreateSimilarity dir

Start blast

Wait for cluster

Copy files From cluster

extract IDsFrom Blast

result

Load Subjectsubset

Load Result

Optional step(runtime test)

Page 9: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Splign

runSplign

Extract subjectSequenceAlt defline

insertSplign

Extract querySequenceAlt defline

Page 10: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Discussion

Page 11: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Graph file-- features --

• Workflow xml file

• Subflows– Parameters– Constants– Interpolating variables

• Global steps– Steps that are only executed once by the whole workflow, even if in multiple

subflows– Declare a namespace?

• Include/exclude– Compile time inclusion/exclusion– If not compiled in, flow passes right through

• Skip-able steps– Runtime exclusion, based on a dynamic test

Page 12: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Graph file-- sharing across projects --

• Live in svn: ApiCommonData/Load/lib/xml/workflow

• Found by system in $GUS_HOME/lib/xml/workflow

• Shared across all projects– Use include/exclude to specify project specific functionality– Therefore, each build must be on its own branch, to avoid interference

Page 13: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Graph file-- step values --

• Avoid side effects in file system (ok in database)– All files shared by steps must be passed as param values

• outputFiles• inputFiles

• Avoid hard-coded values– Use Constants

• Avoid hand-coded values that change each build– Must be computed by step– Eg blast Y= value

• External Db Rls values– Always pass external db rls spec, eg

• Plasmodium Falciparum Chromosomes:2008-07-13

– Upgrade steps to conform to this

• Table names– Want to be able to reuse these values across steps– Always use same format, eg:

• Dots.ExternalNaSequence

Page 14: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Graph file -- cluster --

• Wait for cluster step– Sends email– (takes list of email addresses as config. Maybe we should set up mailing list?)

• Followed by a waitForHuman step. – By default is in “WAIT_FOR_HUMAN” state

• Orthogonal to other states and offline status

– Pilot can turn that off, and it will run

Page 15: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Graph file-- resources pipeline --

• We still use a resources.xml file– Needed by the front end

• Pubmed• Descriptions• Data sources and attributions

• Handled by a regular subflow• Only one unpack step

– Current multiple unpack steps need to be combined into a simple script

• Dedicated step classes:– ApiCommonData::Load::Step::AcquireExternalResource– ApiCommonData::Load::Step::UnpackExternalResource– ApiCommonData::Load::Step::InsertExternalDatabase– ApiCommonData::Load::Step::InsertExternalDatabaseRelease– ApiCommonData::Load::Step::InsertExternalResource• Are subclasses of ApiCommonData::Load::Step::AcquireExternalStep

• Knows how to parse the resources.xml file

Page 16: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Configuration files

• Steps Configuration– Global

• Commonly used properties

• Not validated until runtime

– Static• Defined per step class

• Convenient, often all is necesssary

– Cascading?– Multi-steps file

• Distinguish between stable properties and mutable ones– Version numbers often change

• Svn?

• Pilot configuration?

Page 17: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Runtime File & Directory Structure

• Avoid side-effects• Use explicit input/ouput params in xml file• Move to a nested data directory structure?

/files/cbil/data/cbil/Plasmodb/5.5/workflow/data/Seqfiles/

nrdb.fsaPvivax/

Seqfiles/Psipred/Assembly/

ESTs/Initial/Intermediate/

– Would use the namespace attribute, somehow• Use path statement, eg:

– ../– ../tmhmm

• Steps directories– Use nested structure for subflows?

Page 18: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

External Files Repository

• Do we need it?

• If so, what needs to be improved?

Page 19: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Documentation of the workflow

• Workflow must be able to run in “documentation” mode– Doesn’t run any steps– Instead, produces documentation as expected by front end

• Methods xml file

• Resources xml file

Page 20: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

GUI

• Should it run in the web context?– Security issues– Avoids having to have installed software– Would work from home– All members of team could see the flow– Somehow restrict editability– Could be posted on real site as documentation?

• Overkill? Too detailed?

• Needs to handle subflows– Subflow node needs to show a summary of what is going on inside the subflow

• Multi-colored, to show various states inside it

• Gray out paths that are offline

• Expand/collapse?

Page 21: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Mini-flows

• like mini-pipes, but for workflows…

Page 22: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Slides after this are notes, and other junk

Page 23: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Standard resources

taxonomy EnzymeDBSO NRDB dbEST[tax_id]

GOGO Codes

BibliographicRef terms MO terms MO types MO InterProMO Entry

Orthomclphyletic

orthomcl

Page 24: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

Plasmodb resources

IEDBepitopes

IEDBdbxrefs

NA Genbankdbrefs

AA Genbankdbrefs pdb Pdb index

Page 25: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

P.falciparum resources

ZhangESTs

ApicopolastFlorens

2002

Pf plastidFlorentESTs

Pf mitochonWatanabe Pf

transcriptsWatanabe Pf

ESTsPf GO

AssociationsSanger IT

SNPsSU SNPs Broad SNPs

CombinedSNPs

DeRisiOligos

WinzelerGenetic Var.

array

DeRisiDd2

DeRisiHB3

WinzelerCell Cycle

DeRisi3D7

ScrippsArray

WinzelerGametocyte

DeRisiArray7282

MTC KIArray

BaumMeta data

DurasinghMeta data

GSE5247Meta data

CowmanMeta data

Pfab Array

E-MEXP 449Meta data

E-MEXP 439Meta data

PlasmodbGene ids

E-MEXP 128Meta data

WatersMeta data

WatersGametocyteMass spec

DailyMeta data

GSE2265Meta data

GSE8099Meta data

interactomeWatersFemale

Gametes mass

Mutual info

Plasmo mapy2hSage tag

Array design

Sage tag freqsPf chr

Genbank refs

TIGR geneindexes

BaumArray data

DurasinghArray data

GSE5247Array data

CowmanArray data

E-MEXP 449Array data

E-MEXP 439Array data

E-MEXP 128Array data

WatersArray data

DailyArray data

GSE2265array data

GSE8099Array data

BaumRAD anal

DurasinghRAD anal

GSE5247RAD anal

CowmanRAD anal

E-MEXP 449RAD anal

E-MEXP 439RAD anal

E-MEXP 128RAD anal

WatersRAD anal

DailyRAD anal

GSE2265RAD anal

GSE8099RAD anal

Watersmale

Gametes mass

Watersmixed

Gametes mass

PASADb refs

HagaiEC

WinzelerDb refs

WinzelerLit refs

PredictedProteinstructs

mr4

Cowmansubcellular

Haldarsubcellular

Merozoitepeptides

lasonderoocycts

Florens2004

Broad SNPcoverage

eviganLasonderOocycts

sporozoitesEntrezDbrefs

Pubmeddbrefs

Broad bar codeBroad 3k

genotyping

Lasondersalivary

sporozoites

Page 26: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

P. vivax resources

Watanabe Pvtranscripts

Pv contigsWatanabe Pv

ESTsPv dbrefs Pv GB dbrefs Pv mitochon

Pv chromosomes

TIGR geneindexes

Page 27: Model of a real workflow And issues to discuss…. PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome

C.parvum C.hominis

Synteny

start

End

Plasmo Toxo

Api

End

Start