59
The Taverna Workflow Management Software Suite: Past, Present, Future. Prof Carole Goble CBE FREng FBCS CITP The University of Manchester, UK Software Sustainability Institute UK [email protected] http://www.taverna.org.uk http://www.mygrid.org.uk

The Taverna Workflow Management Software Suite - Past, Present, Future

Embed Size (px)

DESCRIPTION

Carole Goble at VPH meeting in Sheffield

Citation preview

  • 1. The Taverna Workflow Management Software Suite: Past, Present, Future. Prof Carole Goble CBE FREng FBCS CITP The University of Manchester, UK Software Sustainability Institute UK [email protected] http://www.taverna.org.uk http://www.mygrid.org.uk

2. More of what we generally do! Prof Carole Goble CBE FREng FBCS CITP The University of Manchester, UK Software Sustainability Institute UK [email protected] http://www.taverna.org.uk http://www.mygrid.org.uk 3. e-Science, Computational Science, Scientific Computing Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work. Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation. Cope with scales and speed of scientific innovation and data. 4. Data-centric Computation Scientific workflows over Distributed Cyber-Infrastructure. Data sharing Social Methods libraries and catalogues for all types of scientific artefacts and all types of scientists. Knowledge Management Metadata, semantics digital exchange, preservation, publishing Software Engineering Software sustainability, software and data policy, training Products Methods Systems Biology Chemistry Astro-Physics Astronomy Biology Social Science Library Digital Preservation Biodiversity Public Health Applications 5. Computer Science Software Engineering Scientific Informatics Computational Science THEORY PRACTICEAPPLICATION fundamental applied PRODUCT (Open Source) PRINCIPLE Science USE CASE 6. Long Tail Little science Self-organising groups Disconnected, independent, distributed scientists Disconnected, independent, distributed resources Open in the wild. Organised science Organised groups Clubs of scientists Organised, planned and in-house resources Closed and well behaved services. 7. VPH-Share Models of Human Physiology Eagle Genomics Next Generation Sequencing based Patient Diagnostics Astronomy & HelioPhysics Document Preservation Digitisation Systems Biology OpenTox Project Chemistry Development Kit Drug Toxicity Ecological Niche Modelling Population Modelling Meta- genomics Phylo- genetics Data cleaning Data movement Data retrieval and annotation Data analysis Data mining knowledge management Data curation and data warehouse population Data visualisation Parameter sweeps over simulations Drug discovery, small molecules, targets, compounds OpenPHACTS 8. BioSTIF Inputs: data, parameters, configurations Outputs Workflow in a nutshell Orchestrate series of automated / interactive steps Process pipelines Analytic and synthesis procedures Repetitive code-run sweeps Housekeeping tasks Process data at scale Auto documentation Mix in house & public resources, native hosting Chain and choreograph components Handle interoperability Bridge resources Shield operational complexity and change Services & Resources Infrastructures 9. Taverna Workflow Management http://www.taverna.org.uk Dataflow Computational Lambda Calculus with a monad extension* Simple control flows, iterations over collections Data type agnostic, domain independent Data movement, monitoring, staging, reference Custom (VO Tables), XML, JSON Mixed steps Services, codes & command line tools SOAP + REST Web Services Scripts: R, In Workflow Programming Beanshell scripting Codes: Java, libraries, HPC, Grid and ~Cloud platforms etc Nested workflows Interactions and Batch *Turi et al Taverna Workflows: Syntax and Semantics e-Science 2007: 441-448; Sroka et al A formal semantics for the Taverna 2 workflow model J. Comput. Syst. Sci. 76(6): 490-508 (2010) 10. Computational Lambda Calculus Visual Programming Process mining Adaptive & parallel computing Cloud computing SOA, Semantic Web Services Data integration, data quality Semantic representation and linked data Reporting & tracking, credit propagation Workflow reusability, quality, discovery Security, monitoring, fault detection AI planning, re-run analysis, auto-planning, auto-repair, auto-composition, auto- annotation, service discovery, service matching, auto-substitution E.Science laboris Tools Standards Services 11. Weeks -> Hours Surprise predicted result tested in lab. DAXX Gene Genetic differences between breeds Noyes, PNAS 2011 108(22) 9304-9309 BioDiversity Invasive Species Modelling American Horseshow Crabs in the Baltic Trypanosomiasis resistance in African Cattle Software as a Service / (Cloud) Appliance Analytic bottleneck Repetitive, unbiased, accurate record, taming data, transparency, avoiding shortcuts. Interactive steps Dev. Years->Weeks Runs. Weeks -> Hours Generalised ENM data mapping and overlaying pipelines. Workflow-based Computation 12. 15 #SummerSchool 24-Jun-13 VPH-Share @neurist Aneurysm Morphology Workflow P a t ie n t P s e u d o id e n t i e r (P ID ) D e m o g r a p h ic s H e ig h t W e ig h t V it a l S ig n s H e a r t R a t e B lo o d P r e s s u r e F lo w R a t e T r a n s ie n t P r e s s u r e A n e u r y s m P r o p e r t ie s T is s u e P r o p e r t ie s W a ll T h ic k n e s s R is k F a c t o r s M e d ic a l Im a g e s M e d ic a t io n s Patients Patient Avatar Disease Simulation Work o w Systemic Factors Gene Expression Pro le P a t ie n t P s e u d o id e n t i e r (P ID ) D e m o g r a p h ic s H e ig h t W e ig h t V it a l S ig n s H e a r t R a t e B lo o d P r e s s u r e F lo w R a t e T r a n s ie n t P r e s s u r e A n e u r y s m P r o p e r t ie s T is s u e P r o p e r t ie s W a ll T h ic k n e s s R is k F a c t o r s M e d ic a l Im a g e s M e d ic a t io n s A n e u ry sm R u p tu r e P ro le M o rp h o lo g y P r o le H a e m o d y n a m ic P r o le M e c h a n o b io lo g ic a l P r o le P re d ic tio n U n c e rta in ity Patient Avatar Updated RISK Patients Patient Avatar Disease Simulation Workflow Patient Avatar updatedSystemic Factors Gene Expression Profile RISK [Susheel Varma] http://www.vph-share.eu/ 13. Morphological, hemodynamic and structural analyses have been linked to aneurysm genesis, growth and rupture. Evidence indicating differences in morphology and flow between ruptured and unruptured aneurysms have been shown for reduced patient cohorts. Structural wall mechanics has been used to justify the growth and remodelling happening at the aneurysm level. Confidence in physical measures + images + BC, material + BC, material Morphological analysis Direct diagnostic power + Morphological descriptors Structural descriptors Hemodynamic descriptors Haemodynamic analysis Structural analysis Practically, morphological characterizations might currently have the highest predictive capabilities with respect to the other analyses. Morphological Workflow [Susheel Varma] 14. Medical image from imaging equipment @neurIST morphological descriptors Complex indices (Zernike moment invariants) Basic size indices describing aneurysm sac depth neck Morphological Analysis Workflow [Susheel Varma] 15. Implementation in VPH-Share The @neurIST morphological workflow specification in Taverna [Susheel Varma] 16. Biodiversity marine monitoring and health assessment ecological niche modelling Data Intensive Science Collaborative Science Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) Sarah Bourlat 17. Ecological Niche Modeling . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm -ENM with Support Vector Machine and Maximum Entropy -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) The workflows work over large geographical, taxonomic, and environmental scales, incl. terrestrial ecosystems Baltic species invasions of various crabs/sea creatures Interactions of different forest insects and trees 18. Ecological Niche Modeling . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm -ENM with Support Vector Machine and Maximum Entropy -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) The workflows work over large geographical, taxonomic, and environmental scales, incl. terrestrial ecosystems Baltic species invasions of various crabs/sea creatures Interactions of different forest insects and trees BioSTIF 19. www.biovel.eu Ecological Niche Modeling Workflow (ENM) 20. data configuration parameters steps Data and Parameter Sweeps 21. Hosted installation Local installations 22. Taverna: a Knowledge Discovery Framework Asthma sputum inflammatory phenotypes, a transcriptome analysis, Saeedeh Maleki-Dizaji, Chris Newby, Rachid Berair, Rod Smallwood , Chris Brightling 2014 (to be submitted) A systematic approach to a transcriptome analysis to asthma sputum inflammatory phenotypes ISMB 2014. The Battle of the Sexes starts in the oviduct : modulation of oviductal transcriptome by X and Y-bearing spermatozoa: Almiana C, Caballero I, Heath PR, Maleki-Dizaji S, Parrilla I, Cuello C, Gil MA, Vazquez JL, Vazquez JM, Roca J, Martinez EA, Holt WV and Fazeli A. submitted to BMC Genomics 2014 ,(In Press) transcription regulation network involving E2F6, IRF7 and STAT1, Thomas R.J. Lovewella ,Andrew J.G. McDonaghb, Andrew G Messengerb, Saeedeh Maleki- Dizaji, Mimoun Azzouzd and Rachid Tazi-Ahniniaformation submitted to PNAS, 2014 Kiran, M., Bicak, M., Maleki-Dizaji, S., Holcombe, M. FLAME: A Platform for High Performance Computing of Complex Systems. Journal of Acta Physica Polonica 2011. Maleki-Dizaji S, Holcombe M, Rolfe MD, Fisher P, Green J, Poole RK, Graham AI, A Systematic Approach to Understanding Escherichia coli Responses to Oxygen: From Microarray Raw Data to Pathways and Published Abstracts, Online J Bioinformatics, (1):51-59, 2009 [Saeedeh Maleki-Dizaji] 23. Application Runtime Middleware Resources/Codes/Services Infrastructures Repositories Execution Activity Plug-ins Application Scufl Runtime Middleware Resources/Codes/Services Platforms Repositories Taverna Desktop Workbench Taverna Online Web Tool Portals and Applications Engine Server Player Cmd line Provenance Third Party Servers BioSTIF Workflows & workflow components PROV, OPM Data Provenance Registries 24. Taverna Workflow Management Open extensibility Plug-in framework Command line tool Data Services: VOTables for AstroTaverna Optimisations: E.g. Holl. model parameter sweeps Infrastructures: Grid, HPC, Web Services Domains: CDK, BioMart, VOTable Commodities: Excel Spreadsheets, Open Refine, R Plug into other frameworks & platforms Portals: Scratchpads Interactive platforms: iPython Notebook Wfms: KNIME Node, Galaxy tool, Kepler Actor Third party applications Taverna Online XworX OGC chainer 25. Taverna Online: 3rd party app Dr Vadim Surpin and Vitaly Sharanutsa, Institute for Information Transmission Problems of Russian Academy of Sciences (IITP RAS) An online, in-browser application for assembling and running Taverna Workflows over a HPC platform http://onlinehpc.com/site/main 26. Interoperability: Data format/identity mismatches Service interface handling Components: Well described, behaved, curated, annotated modularised workflow modules Semantic annotations, prescribed failover, formats, provenance Organised into common families 27. Taverna Directions AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity. CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, what ifs CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, what ifs CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components AccessAccess Cloud and Scale, Registries Standards data formats, programmatic interfaces. Adapting to change. Security. Governance of components ProcessProcess Seamless, pluggable wf as a service. Scale. Adaptability. Specific-Generic tension. Easier development, user experience Workflow commodities, Research Objects Design practices for reuse. Credit Executable interactive notebooks. Provenance A tool for reproducibility ReportReport EmbedEmbed Workflows in common applications Integration into reporting & publishing Underpin integrative platforms. Service based science and science as a service 28. Fix on demand. Notify as needed. Monitor for decay Workflow/Service Monitors 3rd Party Monitors Workflow analytics Detect and Repair QUASAR toolkit [Zhao et al. Why workflows break e-Science 2012] 29. The Execution Provenance Gap Data tracking Summarisation, Labelling, Distillations, Selective tracking Filtering Big Fine grain 1 White box One System Special tools Collection A Big Graph What do I cite? What did I do? N Black boxes Many Systems My Lab Book Analytics Smart in situ Presentation Why am I citing? Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013: 313-318 Sarah Cohen Boulakia, Jiuqiang Chen, Paolo Missier, Carole A. Goble, Alan R. Williams, Christine Froidevaux: Distilling structure in Taverna scientific workflows: a refactoring approach. BMC Bioinformatics 15(S-1): S12 (2014) http://provenanceweek.dlr.de 30. Tracking Provenance File Stores Lab Books Repositories Granularity Scales Blackbox Hybrid 31. Research Objects Bundles and relates multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms Descriptive reproducibility Exchange, Releasing paradigm for publishing http://www.researchobject.org/ http://www.researchobject.org/ 32. Flexibility Review, Revise/Discard Scale Deploy into tools Comparison Personal Group Production Research Reporting Harden 33. http://nbviewer.ipython.org/github/myGrid/DataHackL eiden/blob/alan/Player_example.ipynb https://www.youtube.com/watch?v=QVQwSOX5S08 ? 34. Archiving Publishing Component Libraries Preserving Recording Storing Exchanging Versioning Sharing PACKS 35. SEEK4Science Sharing and interlinking Methods, Models, Data Data Model Article External Databases Metadata 36. Virtual Liver Network BMBF Groprojekt ~45 organisations, ~70 groups multiscale rep. of the liver clinical impact general public portal 47 Same key requirements: yellow pages, exchange of all sops/data/models, sharing rights Different biology Multiscale data Multiscale models Imaging Different project structure Hierarchies (A, A1, A1.2) Regional groups of groups Flexibility, extensibility, open sourceness of SEEK key 37. simulate models project mgt, access control reporting, citation governance & policies yellow pages of peers projects, experts catalogue and link data, models, samples, specimens, sops, experiments, publications using standards curate & annotate data and models using standards access, link to and deposit in public data and model repositories manage, store and exchange different types and scales of data integrate local and project tools and data systems scaled-out collection & processing 38. experimentalists, modellers, X- informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builders social scientists, librarians, curators Social Computation Storing, Sharing and Reusing data, methods, models, between collaborating and competing scientists e-Laboratories, collaboratories, VREs, repositories An ego-system 39. Computer Scientist Software Engineer Social Engineer 40. Knowledge Computation Accurate, intelligible and comparable descriptions Data interoperability Machine readable metadata Semantic technologies, Ontologies, Linked Data, Data schema 41. Semantic Description Describing and linking data in terms of shared concepts, relationships and identifiers Data object property data property subClassOf Ontology Person Organization Place State name birthdate bornIn worksFor state name phone name livesIn City Event ceo location organizer nearby startDate endDate title isPartOf postalCode Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI [Taheriyan et al adapted] 42. Curation Knowledge Ramps Populous http://www.rightfield.org.uk Katy Wolstencroft 43. Pathways Pharmacological Activities Biological Processes Transcripts Pathological Processes Diseases Genes Proteins Interactions Clinical Drug Applications Indications Drugs Compounds Pharmacological data for drug discovery combining public and private datasets Pre-competitive silo-breaking for competitive analytics 44. Pathways Pharmacological Activities Biological Processes Transcripts Pathological Processes Diseases Genes Proteins Interactions Clinical Drug Applications Indications Drugs Compounds Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency