41
2016-09-04 BioExcel SIG, ECCB, Amsterdam Advances in Scientific Workflow Environments Carole Goble, Stian Soiland-Reyes The University of Manchester [email protected] http://esciencelab.org.uk/

Advances in Scientific Workflow Environments

Embed Size (px)

Citation preview

Page 1: Advances in Scientific Workflow Environments

2016-09-04 BioExcel SIG, ECCB, Amsterdam

Advances in Scientific Workflow Environments

Carole Goble, Stian Soiland-ReyesThe University of Manchester

[email protected]://esciencelab.org.uk/

Page 2: Advances in Scientific Workflow Environments

What is a Workflow? • Orchestrating multiple

computational tasks• Managing the control and

data flow between them• In a world that is

homogeneous or heterogeneous

• Tasks– Local / remote– Local / third party– White, grey or black boxes– Reliable / fragile– Reserved / dynamic– Various underpinning

infrastructure– Various access controls

BioExcel: Biomolecular recognition

Page 3: Advances in Scientific Workflow Environments

What is a Workflow? Automation

– Automate computational aspects– Repetitive pipelines, sweep campaigns

Scaling – compute cycles– Make use of computational

infrastructure & handle large dataAbstraction – people cycles

– Shield complexity and incompatibilities– Report, re-use, evolve, share, compare– Repeat – Tweak - Repeat– First class commodities

Provenance - reporting– Capture, report and utilize log and

data lineage auto-documentation– Traceable evolution, audit,

transparency– Compare

With thanks to Bertram Ludascher: WORKS 2015 Keynote

FindableAccessibleInteroperableReusable(Reproducible)

Page 4: Advances in Scientific Workflow Environments

https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/

Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes

Page 5: Advances in Scientific Workflow Environments

Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture.

[Susheel Varma] http://www.vph-share.eu/

http://taverna.org.uk

Page 6: Advances in Scientific Workflow Environments

Galaxy https://usegalaxy.org/

Page 7: Advances in Scientific Workflow Environments

Marine metagenomics

Workflow Driven

+ Bespoke Scripts

[Rob Finn]

Page 8: Advances in Scientific Workflow Environments

Open PHACTShttps://www.knime.org/

BioExcel workflow

https://www.openphacts.org/

Targets

Pharmacological queriestarget, compound and pathway data

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460

Page 9: Advances in Scientific Workflow Environments

Scripts, Ensemble toolkit, execution patterns

http://www.extasy-project.org/

Page 10: Advances in Scientific Workflow Environments

http://www.myexperiment.org

WF Zoo

Page 11: Advances in Scientific Workflow Environments
Page 12: Advances in Scientific Workflow Environments

Workflow Patterns, templates

Data wrangling& analytics

Simulations

Instrumentpipelines++

http://tpeterka.github.io/maui-project/The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd

Page 13: Advances in Scientific Workflow Environments

Workflow Patterns, templates

Data wrangling& analytics

Simulations

Instrumentpipelines++ Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351

Page 14: Advances in Scientific Workflow Environments

Workflow Patterns, templates• Long running and complex code• Tunable parameters and input sets• Simulation sweeps / iterations• Ensembles, comparisons • Tricky set-ups, human-in-the-loop

interaction• Computational steering• In situ workflows – multiple tasks,

same box, within fixed time– data locality. – human-in-the-loop. – capture provenance.

Data wrangling& analytics

Simulations

Instrumentpipelines++

Page 15: Advances in Scientific Workflow Environments

Traction + ExamplesReuse behaviours

Exploratory vs ProductionDifferent kinds of user / deployment

Developer – User Ratios

BiologistDeveloper ComputationalScientist

Embe

d in A

pplic

ation

Embe

d in p

latfor

m

Embe

d in in

frastr

uctu

re

Page 18: Advances in Scientific Workflow Environments

Existing computational research workflow systems

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

Page 19: Advances in Scientific Workflow Environments

“Multi-scale” WFMS• Workflow

Management System– Its design and

reporting environment– Its execution

environment• The tasks

– tools, codes and services and their execution environments

• Stack layer– App level, infrastructure

level

Page 20: Advances in Scientific Workflow Environments

Component making

Tasks loosely coupled through files, • execute on geographically

distributed clusters, clouds, grids across systems

• execute on multiple facilities• call host services (web / grid

services)

DAICDistributed Area/Instrument Computing

“Multi-scale” WFMS

Tasks tightly coupled• exchanging info over

memory/storage• network of supercomputers • In situ workflows – multiple tasks, same

box, within fixed time

HPC

InteroperabilityPortabilityGranularityMaintenance

Page 21: Advances in Scientific Workflow Environments

Workflow Environment Ecosystem

Page 22: Advances in Scientific Workflow Environments

Copernicus workflow engine for parallel adaptive molecular dynamics

• Peer-to-peer distributed computing platform– high-level parallelization of

statistical sampling problems• Consolidation of

heterogeneous compute resources

• Automatic resource matching of jobs against compute resources

• Automatic fault tolerance of distributed work

• Workflow execution engine to define a problem (reporting) and trace its results live (provenance)

• Flexible plugin facilities – programs to be integrated to the

workflow execution engine

Free Energy Workflow using GROMACS

http://copernicus-computing.org/

Page 23: Advances in Scientific Workflow Environments

COMPs/PyCOMPs: Programmer Productivity framework

• Sequential programming– Parallelisation and distribution

heavy-lifting– Dependency detection

• Infrastructure unaware– Abstract application from

underlying infrastructure– Portability

• Standard Programming Languages– Java, Python, C/C++

• No (or few!) APIs– Standard Java

Page 24: Advances in Scientific Workflow Environments

Shield the user/programmer

Exposure to the infrastructure

System Design

Resource provisioning

Adaptive/dynamic workflows

Manage/minimize data transfers

Smart parallelism

Code staging

Data stagingFail-over

Human in the loop

OS/R Guarantees

Service Guarantees

Page 25: Advances in Scientific Workflow Environments

Stop Press!GUIs not essential!• Canvas, drag-drop blocks,

arrows, run button• Command-line & embedding

in developer or user applications

Scripts can be workflows!• WMS<->Scripts• Script vs Workflows/ASAP:

– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Page 26: Advances in Scientific Workflow Environments

Stop Press!GUIs not essential!• Canvas, drag-drop blocks,

arrows, run button• Command-line & embedding

in developer or user applications

Scripts can be workflows!• WMS <-> Scripts• Script vs Workflows/ASAP:

– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Work close to a problem-specific ad-hoc data model

Domain Specific Language "programming-lite" scripts

• wire with declarative "makefile"-like DAG

Plus

• procedural scripting and expressions in languages like Javascript and Python

Nextflow, SnakeMake, Common Workflow Language

Page 27: Advances in Scientific Workflow Environments

GUIs Are Essential take-up by the user base

Page 28: Advances in Scientific Workflow Environments

Workflowising script software eco-systemsprime example: provenance

ASAP• common, interoperable

provenance recording– W3C PROV

ASAP• YesWorkflow.org

– Annotations in script yield workflow view

ASAP• Library profilers

– noWorkflow• runtime provenance

recorders– Sumatra, RDataTracker

Page 29: Advances in Scientific Workflow Environments

Provenance the link between computation and results

W3C PROV model standard

record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt partial repeat/reproducecarry attributionscompute creditscompute data quality/trustselect data to keep/releaseoptimisation and debugging

Metadata propagation –where was the physical sample collected, and who should be attributed?

Task-based abstractions: simplifying provenance using motifs and tool annotations“Free energy calculation” rather than 5 steps including preparation of PDB files and GROMACS execution

Page 30: Advances in Scientific Workflow Environments

Provenance the link workflow variants and workflow reuse and repurpose

W3C PROV model standard?record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt carry attributionscompute design creditsversioning, forking, cloning

Nested workflows functions by stealth

Copy and paste fragmentationDesigning for reuse Find and Go

Software practicesSystematic reuse

Guidelines for persistently identifying software using DataCitehttps://epubs.stfc.ac.uk/work/24058274

https://www.force11.org/software-citation-principles

Page 31: Advances in Scientific Workflow Environments

ASAP Wfms for FAIR Science

Automate: workflows, programs and services folks already use or want to use

Scale: Enable computational productivity

Abstract: Enable human productivity

Provenance: Record and use

Provenance

Reproducibility

PortabilityReuse

UsabilityUnderstanding

Validation

Workflow Plugged in Code

Reporting Comparison

Interoperability

Thanks to Bertram Ludascher

Page 33: Advances in Scientific Workflow Environments

● Task-specific “mini-workflow” fragments– e.g. using Gromacs, CPMD,

HADDOCK● Packaged

– EGI VM images and Docker containers

● Backed by existing registries– ELIXIR’s bio.tools and EGI

App DB● Instantiated as cloud

instances– private (Open Nebula, Open

Stack)– public (e.g. Amazon AWS )

Application Building BlocksBioExcel Virtualised Software Library“transversal workflow units”, higher level operations

Page 34: Advances in Scientific Workflow Environments

BioExcel Use cases

● Genomics● Ensembl Molecular

simulations● Free Energy simulations● Multiscale modelling of

molecular basis for odor and taste

● Biomolecular recognition● Pharmacological queries● Virtual Screening

Page 35: Advances in Scientific Workflow Environments

Finding valid pathways through free-energy landscapes: implementation of the “string of swarms” method using Copernicus as a workflow manager, and GROMACS as a compute engine.

Page 36: Advances in Scientific Workflow Environments

Workflow Interoperability. • Common format for bioinformatics tool

& workflow execution• Community based standards effort• Designed for clusters & clouds• Supports the use of containers (e.g.

Docker)• Specify data dependencies between

steps• Scatter/gather on steps• Nest workflows in steps

• Develop your pipeline on your local computer (optionally with Docker)

• Execute on your research cluster or in the cloud

• Deliver to users via workbenches

• EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file

Page 37: Advances in Scientific Workflow Environments

Workflow Research Object Bundleresearchobject.org

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003

application/vnd.wf4ever.robundle+zip

Page 38: Advances in Scientific Workflow Environments

Generic Grid middleware

Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,

4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

Generic Grid middleware

Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,

4) Enactment, 5) Human in the loop computing

Taverna Kepler Triana VLAMG

Sub workflow 1

Sub workflow 2

Sub workflow 3

Scientific experiment: a meta workflow

Sub workflow 4

Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam

Page 39: Advances in Scientific Workflow Environments

2007

2015

Page 40: Advances in Scientific Workflow Environments

http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-research/

Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin (UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse (EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti (Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)

Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou

Sign upASAP!

Page 41: Advances in Scientific Workflow Environments

Bonus Slides