Upload
kiefer
View
25
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Model of a real workflow. A subset of the plasmodb pipeline (in progress!) And issues to discuss…. PlasmoDB workflow. P.Falciparum Standard genome. P.Vivax Standard genome. P.Yoelli Standard genome. P.Berghei Standard genome. P.Chabaudi Standard genome. P.Knowlesi Standard - PowerPoint PPT Presentation
Citation preview
PlasmoDB workflow
P.FalciparumStandardgenome
P.GallenaciumStandardgenome
P.VivaxStandardgenome
P.YoelliStandardgenome
P.BergheiStandardgenome
P.ChabaudiStandardgenome
P.KnowlesiStandardgenome
P.ReichonowiStandardgenome
P.FalciparumNon-standard
synteny
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
Global steps(oval)
Subflows(double line)
Compile timeInclude/ExcludeCalculate
Translatedprotein
In: Pf, Pk
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
CalculateTranslated
protein
In: Pf, Pk
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
Psipred
fix protein IDsFor psipred
create psipredTask dir
copy Data Dirto cluster
copy psipredProtein fileto cluster
start psipredOn cluster
wait for cluster
copy psipredFiles from
cluster
fix psipredFile names
make Alg Inv
load psipred
create psipredData dir
BLAST
CreateSimilarity dir
Start blast
Wait for cluster
Copy files From cluster
extract IDsFrom Blast
result
Load Subjectsubset
Load Result
Optional step(runtime test)
Steps
• Subflows– Parameters– Constants– Interpolating variables
• Global steps– Steps that are only executed once by the whole workflow, even if in multiple
subflows– Declare a namespace?
• Include/exclude– Compile time inclusion/exclusion– If not compiled in, flow passes right through
• Skip-able steps– Runtime exclusion, based on a dynamic test
Step Values
• Avoid side effects in file system (ok in database)– All files shared by steps must be passed as param values
• outputFiles• inputFiles
• Avoid hard-coded values– Use Constants
• Avoid hand-coded values that change each build– Must be computed by step– Eg blast Y= value
• External Db Rls values– Always pass external db rls spec, eg
• Plasmodium Falciparum Chromosomes:2008-07-13
– Upgrade steps to conform to this
• Table names– Want to be able to reuse these values across steps– Always use same format, eg:
• Dots.ExternalNaSequence
Cluster
• Wait for cluster step– Sends email– (takes list of email addresses as config. Maybe we should set up mailing list?)
• Followed by a waitForHuman step. – By default is in “WAIT_FOR_HUMAN” state
• Orthogonal to other states and offline status
– Pilot can turn that off, and it will run
Configuration
• Steps Configuration– Global
• Commonly used properties
• Not validated until runtime
– Static• Defined per step class
• Convenient, often all is necesssary
– Cascading?– Multi-steps file
• Distinguish between stable properties and mutable ones– Version numbers often change
• Svn
• Pilot configuration?
File & Directory Structure
• Avoid side-effects• Use explicit input/ouput params in xml file• Move to a nested data directory structure?
/files/cbil/data/cbil/Plasmodb/5.5/workflow/data/Seqfiles/
nrdb.fsaPvivax/
Seqfiles/Psipred/Assembly/
ESTs/Initial/Intermediate/
– Would use the namespace attribute, somehow• Use path statement, eg:
– ../– ../tmhmm
• Steps directories– Use nested structure for subflows?
GUI
• Should it run in the web context?– Security issues– Avoids having to have installed software– Would work from home– All members of team could see the flow– Somehow restrict editability– Could be posted on real site as documentation?
• Overkill? Too detailed?
• Needs to handle subflows– Subflow node needs to show a summary of what is going on inside the subflow
• Multi-colored, to show various states inside it
• Gray out paths that are offline
• Expand/collapse?
Resource Pipeline
• Not worked out yet
• Needs to be handled by regular subflow
• Unpacks will need to be collapsed into a single unpack script
• Resources.xml file as needed by front end can be produced by a documentation run of the pipeline
• Does it need to be configured in xml, or would a properties file be good enough?
Documentation of the workflow
• Workflow must be able to run in “documentation” mode– Doesn’t run any steps– Instead, produces documentation as expected by front end
• Methods xml file
• Resources xml file
Standard resources
taxonomy EnzymeDBSO NRDB dbEST[tax_id]
GOGO Codes
BibliographicRef terms MO terms MO types MO InterProMO Entry
Orthomclphyletic
orthomcl
P.falciparum resources
ZhangESTs
ApicopolastFlorens
2002
Pf plastidFlorentESTs
Pf mitochonWatanabe Pf
transcriptsWatanabe Pf
ESTsPf GO
AssociationsSanger IT
SNPsSU SNPs Broad SNPs
CombinedSNPs
DeRisiOligos
WinzelerGenetic Var.
array
DeRisiDd2
DeRisiHB3
WinzelerCell Cycle
DeRisi3D7
ScrippsArray
WinzelerGametocyte
DeRisiArray7282
MTC KIArray
BaumMeta data
DurasinghMeta data
GSE5247Meta data
CowmanMeta data
Pfab Array
E-MEXP 449Meta data
E-MEXP 439Meta data
PlasmodbGene ids
E-MEXP 128Meta data
WatersMeta data
WatersGametocyteMass spec
DailyMeta data
GSE2265Meta data
GSE8099Meta data
interactomeWatersFemale
Gametes mass
Mutual info
Plasmo mapy2hSage tag
Array design
Sage tag freqsPf chr
Genbank refs
TIGR geneindexes
BaumArray data
DurasinghArray data
GSE5247Array data
CowmanArray data
E-MEXP 449Array data
E-MEXP 439Array data
E-MEXP 128Array data
WatersArray data
DailyArray data
GSE2265array data
GSE8099Array data
BaumRAD anal
DurasinghRAD anal
GSE5247RAD anal
CowmanRAD anal
E-MEXP 449RAD anal
E-MEXP 439RAD anal
E-MEXP 128RAD anal
WatersRAD anal
DailyRAD anal
GSE2265RAD anal
GSE8099RAD anal
Watersmale
Gametes mass
Watersmixed
Gametes mass
PASADb refs
HagaiEC
WinzelerDb refs
WinzelerLit refs
PredictedProteinstructs
mr4
Cowmansubcellular
Haldarsubcellular
Merozoitepeptides
lasonderoocycts
Florens2004
Broad SNPcoverage
eviganLasonderOocycts
sporozoitesEntrezDbrefs
Pubmeddbrefs
Broad bar codeBroad 3k
genotyping
Lasondersalivary
sporozoites
P. vivax resources
Watanabe Pvtranscripts
Pv contigsWatanabe Pv
ESTsPv dbrefs Pv GB dbrefs Pv mitochon
Pv chromosomes
TIGR geneindexes