View
221
Download
2
Tags:
Embed Size (px)
Citation preview
Model of a real workflow
And issues to discuss…
PlasmoDB workflow
P.FalciparumStandardgenome
P.GallenaciumStandardgenome
P.VivaxStandardgenome
P.YoelliStandardgenome
P.BergheiStandardgenome
P.ChabaudiStandardgenome
P.KnowlesiStandardgenome
P.ReichonowiStandardgenome
P.FalciparumNon-standard
synteny
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
Global steps(oval)
Subflows(double line)
Compile timeInclude/ExcludeCalculate
Translatedprotein
In: Pf, Pk
Standard Genome Workflow
blastxNrdb
genome
Splign
In: Pf, Pb, Py, Pv
blastpnrdb proteins
CalculateTranslated
protein
In: Pf, Pk
molecularweight
Isolelectricpoint
molecularWeight
Min/max
psipred
run TMHMM
Load TMHMM
taxonomy SONRDB
Genome
TIGR TGI
Extract proteins
Extract genomicsequence
Copy proteinsTo cluster
Copy genomicseqs
To cluster
NRDB
Copy from downloadsite
Shorten defline
NRDB resource
Copy to clusterCopy to cluster
Resources
acquire
unpack
ext db
Ext db rls
insert
Psipred
fix protein IDsFor psipred
create psipredTask dir
copy Data Dirto cluster
copy psipredProtein fileto cluster
start psipredOn cluster
wait for cluster
copy psipredFiles from
cluster
fix psipredFile names
make Alg Inv
load psipred
create psipredData dir
BLAST
CreateSimilarity dir
Start blast
Wait for cluster
Copy files From cluster
extract IDsFrom Blast
result
Load Subjectsubset
Load Result
Optional step(runtime test)
Splign
runSplign
Extract subjectSequenceAlt defline
insertSplign
Extract querySequenceAlt defline
Discussion
Graph file-- features --
• Workflow xml file
• Subflows– Parameters– Constants– Interpolating variables
• Global steps– Steps that are only executed once by the whole workflow, even if in multiple
subflows– Declare a namespace?
• Include/exclude– Compile time inclusion/exclusion– If not compiled in, flow passes right through
• Skip-able steps– Runtime exclusion, based on a dynamic test
Graph file-- sharing across projects --
• Live in svn: ApiCommonData/Load/lib/xml/workflow
• Found by system in $GUS_HOME/lib/xml/workflow
• Shared across all projects– Use include/exclude to specify project specific functionality– Therefore, each build must be on its own branch, to avoid interference
Graph file-- step values --
• Avoid side effects in file system (ok in database)– All files shared by steps must be passed as param values
• outputFiles• inputFiles
• Avoid hard-coded values– Use Constants
• Avoid hand-coded values that change each build– Must be computed by step– Eg blast Y= value
• External Db Rls values– Always pass external db rls spec, eg
• Plasmodium Falciparum Chromosomes:2008-07-13
– Upgrade steps to conform to this
• Table names– Want to be able to reuse these values across steps– Always use same format, eg:
• Dots.ExternalNaSequence
Graph file -- cluster --
• Wait for cluster step– Sends email– (takes list of email addresses as config. Maybe we should set up mailing list?)
• Followed by a waitForHuman step. – By default is in “WAIT_FOR_HUMAN” state
• Orthogonal to other states and offline status
– Pilot can turn that off, and it will run
Graph file-- resources pipeline --
• We still use a resources.xml file– Needed by the front end
• Pubmed• Descriptions• Data sources and attributions
• Handled by a regular subflow• Only one unpack step
– Current multiple unpack steps need to be combined into a simple script
• Dedicated step classes:– ApiCommonData::Load::Step::AcquireExternalResource– ApiCommonData::Load::Step::UnpackExternalResource– ApiCommonData::Load::Step::InsertExternalDatabase– ApiCommonData::Load::Step::InsertExternalDatabaseRelease– ApiCommonData::Load::Step::InsertExternalResource• Are subclasses of ApiCommonData::Load::Step::AcquireExternalStep
• Knows how to parse the resources.xml file
Configuration files
• Steps Configuration– Global
• Commonly used properties
• Not validated until runtime
– Static• Defined per step class
• Convenient, often all is necesssary
– Cascading?– Multi-steps file
• Distinguish between stable properties and mutable ones– Version numbers often change
• Svn?
• Pilot configuration?
Runtime File & Directory Structure
• Avoid side-effects• Use explicit input/ouput params in xml file• Move to a nested data directory structure?
/files/cbil/data/cbil/Plasmodb/5.5/workflow/data/Seqfiles/
nrdb.fsaPvivax/
Seqfiles/Psipred/Assembly/
ESTs/Initial/Intermediate/
– Would use the namespace attribute, somehow• Use path statement, eg:
– ../– ../tmhmm
• Steps directories– Use nested structure for subflows?
External Files Repository
• Do we need it?
• If so, what needs to be improved?
Documentation of the workflow
• Workflow must be able to run in “documentation” mode– Doesn’t run any steps– Instead, produces documentation as expected by front end
• Methods xml file
• Resources xml file
GUI
• Should it run in the web context?– Security issues– Avoids having to have installed software– Would work from home– All members of team could see the flow– Somehow restrict editability– Could be posted on real site as documentation?
• Overkill? Too detailed?
• Needs to handle subflows– Subflow node needs to show a summary of what is going on inside the subflow
• Multi-colored, to show various states inside it
• Gray out paths that are offline
• Expand/collapse?
Mini-flows
• like mini-pipes, but for workflows…
Slides after this are notes, and other junk
Standard resources
taxonomy EnzymeDBSO NRDB dbEST[tax_id]
GOGO Codes
BibliographicRef terms MO terms MO types MO InterProMO Entry
Orthomclphyletic
orthomcl
Plasmodb resources
IEDBepitopes
IEDBdbxrefs
NA Genbankdbrefs
AA Genbankdbrefs pdb Pdb index
P.falciparum resources
ZhangESTs
ApicopolastFlorens
2002
Pf plastidFlorentESTs
Pf mitochonWatanabe Pf
transcriptsWatanabe Pf
ESTsPf GO
AssociationsSanger IT
SNPsSU SNPs Broad SNPs
CombinedSNPs
DeRisiOligos
WinzelerGenetic Var.
array
DeRisiDd2
DeRisiHB3
WinzelerCell Cycle
DeRisi3D7
ScrippsArray
WinzelerGametocyte
DeRisiArray7282
MTC KIArray
BaumMeta data
DurasinghMeta data
GSE5247Meta data
CowmanMeta data
Pfab Array
E-MEXP 449Meta data
E-MEXP 439Meta data
PlasmodbGene ids
E-MEXP 128Meta data
WatersMeta data
WatersGametocyteMass spec
DailyMeta data
GSE2265Meta data
GSE8099Meta data
interactomeWatersFemale
Gametes mass
Mutual info
Plasmo mapy2hSage tag
Array design
Sage tag freqsPf chr
Genbank refs
TIGR geneindexes
BaumArray data
DurasinghArray data
GSE5247Array data
CowmanArray data
E-MEXP 449Array data
E-MEXP 439Array data
E-MEXP 128Array data
WatersArray data
DailyArray data
GSE2265array data
GSE8099Array data
BaumRAD anal
DurasinghRAD anal
GSE5247RAD anal
CowmanRAD anal
E-MEXP 449RAD anal
E-MEXP 439RAD anal
E-MEXP 128RAD anal
WatersRAD anal
DailyRAD anal
GSE2265RAD anal
GSE8099RAD anal
Watersmale
Gametes mass
Watersmixed
Gametes mass
PASADb refs
HagaiEC
WinzelerDb refs
WinzelerLit refs
PredictedProteinstructs
mr4
Cowmansubcellular
Haldarsubcellular
Merozoitepeptides
lasonderoocycts
Florens2004
Broad SNPcoverage
eviganLasonderOocycts
sporozoitesEntrezDbrefs
Pubmeddbrefs
Broad bar codeBroad 3k
genotyping
Lasondersalivary
sporozoites
P. vivax resources
Watanabe Pvtranscripts
Pv contigsWatanabe Pv
ESTsPv dbrefs Pv GB dbrefs Pv mitochon
Pv chromosomes
TIGR geneindexes
C.parvum C.hominis
Synteny
start
End
Plasmo Toxo
Api
End
Start