Scientific Workflows: what do we have, what do we miss?

Scientific Workflows: what do we have, what do we miss?

Paolo Romano

IRCCS AOU San Martino – IST,

Genova, Italy

([email protected], skype: p.romano)

Talk outline

Aims of data integration in Life Sciences

A methodology for the automation of data retrieval and analysis processes

Workflow Management Systems

Issues related to: automatic composition,

execution performances,

workflow reuse

22 June 2013 2 Scientific Workflows: what do we miss?

Biomedical databases


Accessible on-line by means of human-centered interfaces

Don’t share interface, data contents and structure, encoding

Don’t interoperate

Oblige researchers to “cut & paste” data

May have huge size

Some figures European Nucleotide Archive:

195,241,608 sequences, 292,078,866,691 bases

UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs

PRIDE: 111,219,191 spectra

IntAct: 229,082 interactions

ArrayExpress: ~16,000 experiments, ~450,000 hybridizations

22 June 2013 4

DB size

Next-Generation Sequencing: 16Gb / experiment!

Scientific Workflows: what do we miss?

Some figures European Nucleotide Archive:

195,241,608 sequences, 292,078,866,691 bases

UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs

PRIDE: 111,219,191 spectra

IntAct: 229,082 interactions

ArrayExpress: ~16,000 experiments, ~450,000 hybridizations

22 June 2013 5

DB size

Next-Generation Sequencing: 16Gb / experiment!


An international collaboration aimed at building a detailed map of human genome variability.

Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010).

Data: ~4.9 Tbases (~3 Gbases/individual)

Found: 15M mutations, 1M deletions/insertions, 20K major variants

The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010.

DOI:10.1038/nature09534 http://www.1000genomes.org/

22 June 2013 6

1000 Genomes Project


An international collaboration aimed at building a detailed map of human genome variability.

Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010).

Data: ~4.9 Tbases (~3 Gbases/individual)

Found: 15M mutations, 1M deletions/insertions, 20K major variants

The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010.

DOI:10.1038/nature09534 http://www.1000genomes.org/

22 June 2013 7

1000 Genomes Project

Impossible without bioinformatics

Unmanageable without automation of processes


22 June 2013 8

Data integration: aims

Data integration and automation of retrieval and analysis processes are needed for: o Achieving a precise and comprehensive vision of

available information

o Carrying out queries and analysis involving many databases and software tools automatically

o Carrying out analysis of huge data quantities efficiently

o Implementing an effective data mining


“A computerized facilitation or automation of a business process, in whole or part" (Workflow Management Coalition)

Aim:

Implementing data analysis processes in standardized enviroments

Main advantages:

efficiency: being automatic procedures, make researchers free from repetitive tasks and e support “good practices”,

reproducibiliy: analysis may be replicated over time, easily and effectively,

reuse: both intermediate results and workflows may be reused,

traceability: the workflow is enacted in a environment that allows tracing back results.

What is a Workflow


An experiment Prediction of the structure of a protein by homology


Researchers carrying out the analysis need to know: Which tools and dbs are needed, where they

reside, and how to use them In which order they must be used How to transfer data between them How to reconcile semantics of data used by

services

Manual


In an automated procedure

software must:

Know which tool/db is able to carry out a given task (e.g. aligning sequence, retrieving protein structure data)

Find real implementations (e.g. BLAST, provided by NCBI)

Link services in a workflow enabling to achieve the desired task

Transfer data appropriately between services

Automatic


Workflow for CABRI Network Services 22 June 2013 13 Scientific Workflows: what do we miss?

o Define XML languages with controlled vocabularies

o Archive data in XML formats

o Make use of Web Services for data exchange between services

o Associate data and analysis to proper items of an ontology of bioinformatics data, data types, and tasks

o Encode processes as workflows

Methodology: components


Both industrial and academic WfMS are available and their use for Life Sciences is now widespread.

Biopipe, an add-on for bioperl GPipe, an extension of Pise

Taverna (EBI), a component of myGrid platform Pegasys (University of British Columbia) EGene (Universidade de São Paulo) Wildfire (Bioinformatics Institute, Singapore)

Pipeline Pilot (SciTegic) BioWBI, Bioinformatic Workflow Builder Interface (IBM)



Software Type Standard License URL

Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/

Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/

ProGenGrid Stand-alone NA NA

http://datadog.unile.it/progen

DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/

Kepler Stand-alone MoML Open source http://kepler-project.org/

GPipe Interfaccia Web,

servizi locali

GPipe XML Open source http://if-

web1.imb.uq.edu.au/Pise/5.a/gpipe.html

EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/

BioWMS Interfaccia Web,

servizi remoti

XPDL Public use http://litbio.unicam.it:8080/biowms/

BioWEP Portale XScufl XPDL

Open source http://bioinformatics.istge.it/biowep/

BioWBI Interfaccia Web,

servizi locali

Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi

Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/

Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/

Triana Stand-alone Triana Workflow

Language

Open source http://www.trianacode.org/

Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/

FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/

Biomake Libreria software NA Open source http://skam.sourceforge.net/


Various software types and different standards


Taverna Workbench is the best known and most adopted in life sciences Developed in the context of the myGrid platform Univ. Manchester and EBI main developers Open source at SourceForge.net

It allows to: Build and execute workflows for complex analysis … by getting access to remote and local services … displaying results in various formats … describing data through an ad-hoc ontology

Requirements: java plus Windows / Mac / Linux Open source: http://taverna.sourceforge.net/ Current version: 2.4

Taverna Workbench


http://taverna.sourceforge.net/

WfMS are increasingly used for data integration and analysis in biomedical research.

Here, we highlight some of current issues. Issues: Automatic composition of workflows Performances Reproducibility and reuse

WfMS: some current issues


Researchers only care for scientific results!

Building workflows may be a burden Various skills are requested, and GUI do not

solve

Workflow composition should be much simpler, and become semi-automatic

Automatic composition



22 June 2013 20


Automatic selection of

best services

Automated service

identification and composition

Adapters for different data

formats

Automatic conversion of

formats Ontology of methods, tools and data types

Integration with

repositories

Controlled Language Interface



22 June 2013 21


Automatic selection of

best services

Automated service

identification and composition

Adapters for different data

formats

Automatic conversion of

formats Ontology of methods, tools and data types

Integration with

repositories

Controlled Language Interface


A trade-off is required between rich semantic annotations and design complexity.

Semantic-based solutions available for controlled set of services.

Beyond Taverna

MyGrid team developed tools identification of services and supporting reuse of workflows

BioCatalogue

Annotated catalogue of Web Services for Life Science

MyExperiment

Repository of workflows for Life Science, enabled by social networking features


Allows to define all: Data analysis tasks for bioinformatics

Data types

Possible relations betweeb tasks and data types (I/O)

Transformations between equivalent data (format)

Transformations between related data (through elaboration, e.g.: triplet AA, gene symbol sequence)

Fondamental in order to: Validate data flow and elaborations

Support automatic workflow composition

EDAM (EMBRACE Data and Methods) Ontology

EDAM Ontology


EDAM (EMBRACE Data and Methods)

Topic: context of the analysis: domain of a study or an experiment

Operation: task carried out

Data: a data type used in bioinformatics

Format: a format used for encoding some data

http://edamontology.sourceforge.net/

EDAM Ontology


Topic

Topic "A general bioinformatics subject or category, such as a field of

study, data, processing, analysis or technology.“

"Biological data resources“ "Nucleic acid analysis“

"Protein analysis“ "Sequence analysis“

"Structure analysis“ "Phylogenetics“

"Proteomics“ "Data handling“

"Chemoinformatics“ "Transcriptomics“

"Literature and reference“ "Ontologies, nomenclature and

"Immunoinformatics“ classification“

"Genetics“ "Systems biology"

"Ecoinformatics“ "Genomics"


Operation

Operation "A function or process performed by a tool; what is done, but

not (typically) how or in what context."

"Alignment“ "Analysis and processing“

"Annotation“ "Classification“

"Comparison“ "Editing“

"Mapping and assembly“ "Modelling and simulation“

"Optimisation and refinement“ "Plotting and rendering“

"Prediction, detection and recognition“

"Search and retrieval“ "Validation and standardisation"


Data

Data "A type of data in common use in bioinformatics." Include: Core data, Identifier, Parameter, report

"Alignment“ "Article“ "Biological model“

"Classification“ "Codon usage table“ "Data index“

"Data reference“ "Experimental measurement“

"Gene expression profile“ "Image“ "Map“ "Matrix“

"Microarray data“ "Molecular interaction“ "Molecular property“

"Ontology“ "Ontology concept“ "Pathway or

"Phylogenetic raw data“ "Phylogenetic tree“ network“

"Reaction data“ "Schema“ "Secondary structure“

"Sequence“ "Sequence motif“ "Sequence profile“

"Structural (3D) profile“ "Structure“ "Workflow"


Format e Identifier

Format "A specific layout for encoding a specific type of data in a

computer file or memory."

"Binary“ "Format (typed)“ "HTML“ "RDF“

"Text“ "XML“

Identifier "A label that identifies (typically uniquely) something such as

data, a resource or a biological entity."

"Accession“ "Identifier (hybrid)“ "Identifier (typed)“

"Identifier with metadata“ "Name"


Researchers want best possible results in the shortest possible time! No matter which database, site, computer are used Distributed nature of data sources (network issues, e.g. timeout and unavailabilty of sites) Large data volumes (reduced data transfer) Complex data analysis (implying HPC/cloud)

Perfomances


Optimization of performances

22 June 2013 30

Optimization

Runtime error detection

Task-level failure

recovery

Evaluation of alternative

services Task dependency

analysis & flow parallelization

Parallelization on cluster

or HPC architecture


Optimization of performances

22 June 2013 31

Optimization

Runtime error detection

Task-level failure

recovery

Evaluation of alternative

services Task dependency

analysis & flow parallelization

Parallelization on cluster

or HPC architecture


Alternative services SRS by Web Services (SWS) provides access to public SRS implementations by selecting the most up-to-date, working site for any given database

Reproducibility of analysis in life sciences is fundamental!

Dependency on current contents of databases

Dependency on the current status and variability of tools

NB! Perfect reproducibility in-silico is impossible!

Reuse of intermediate results and procedures

Reproducibility and reuse


Reproducibility & reuse

22 June 2013 33

Reproducibility and reuse of

results

State of databases and

tools

Prospective provenance

data

Retrospective provenance

data Reuse of intermediate

results

Caching


Reproducibility & reuse

22 June 2013 34

Reproducibility and reuse of

results

State of databases and

tools

Prospective provenance

data

Retrospective provenance

data Reuse of intermediate

results

Caching


Prospective provenance Workflow structural model, dependencies from services, databases, or software libraries, systems dependencies

Retrospective provenance Observations from run time events: data produced and consumed and services accessed

In collaboration with

Paolo MISSIER School of Computing Sciences, Newcastle University, UK

[email protected]

Thanks!


Technology

Scientific Workflows: what do we have, what do we miss?