Upload
paolo-romano
View
104
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation given on June 22, 2013, in Nice, at the CIBB 2013 International Workshop. In collaboration with Paolo Missier, University of Newcastle upon Tyne, UK
Citation preview
Scientific Workflows: what do we have, what do we miss?
Paolo Romano
IRCCS AOU San Martino – IST,
Genova, Italy
([email protected], skype: p.romano)
Talk outline
Aims of data integration in Life Sciences
A methodology for the automation of data retrieval and analysis processes
Workflow Management Systems
Issues related to: automatic composition,
execution performances,
workflow reuse
22 June 2013 2 Scientific Workflows: what do we miss?
Biomedical databases
22 June 2013 3 Scientific Workflows: what do we miss?
Accessible on-line by means of human-centered interfaces
Don’t share interface, data contents and structure, encoding
Don’t interoperate
Oblige researchers to “cut & paste” data
May have huge size
Some figures European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress: ~16,000 experiments, ~450,000 hybridizations
22 June 2013 4
DB size
Next-Generation Sequencing: 16Gb / experiment!
Scientific Workflows: what do we miss?
Some figures European Nucleotide Archive:
195,241,608 sequences, 292,078,866,691 bases
UniProtKB: 12,347,303 sequences, 3,974,018,240 AAs
PRIDE: 111,219,191 spectra
IntAct: 229,082 interactions
ArrayExpress: ~16,000 experiments, ~450,000 hybridizations
22 June 2013 5
DB size
Next-Generation Sequencing: 16Gb / experiment!
Scientific Workflows: what do we miss?
An international collaboration aimed at building a detailed map of human genome variability.
Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions, 20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 6
1000 Genomes Project
Scientific Workflows: what do we miss?
An international collaboration aimed at building a detailed map of human genome variability.
Pilot phase: identification of 95% of variations present in at least 1% of population for three ethnic groups (Oct 28, 2010).
Data: ~4.9 Tbases (~3 Gbases/individual)
Found: 15M mutations, 1M deletions/insertions, 20K major variants
The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010.
DOI:10.1038/nature09534 http://www.1000genomes.org/
22 June 2013 7
1000 Genomes Project
Impossible without bioinformatics
Unmanageable without automation of processes
Scientific Workflows: what do we miss?
22 June 2013 8
Data integration: aims
Data integration and automation of retrieval and analysis processes are needed for: o Achieving a precise and comprehensive vision of
available information
o Carrying out queries and analysis involving many databases and software tools automatically
o Carrying out analysis of huge data quantities efficiently
o Implementing an effective data mining
Scientific Workflows: what do we miss?
“A computerized facilitation or automation of a business process, in whole or part" (Workflow Management Coalition)
Aim:
Implementing data analysis processes in standardized enviroments
Main advantages:
efficiency: being automatic procedures, make researchers free from repetitive tasks and e support “good practices”,
reproducibiliy: analysis may be replicated over time, easily and effectively,
reuse: both intermediate results and workflows may be reused,
traceability: the workflow is enacted in a environment that allows tracing back results.
What is a Workflow
22 June 2013 9 Scientific Workflows: what do we miss?
An experiment Prediction of the structure of a protein by homology
22 June 2013 10 Scientific Workflows: what do we miss?
Researchers carrying out the analysis need to know: Which tools and dbs are needed, where they
reside, and how to use them In which order they must be used How to transfer data between them How to reconcile semantics of data used by
services
Manual
22 June 2013 11 Scientific Workflows: what do we miss?
In an automated procedure
software must:
Know which tool/db is able to carry out a given task (e.g. aligning sequence, retrieving protein structure data)
Find real implementations (e.g. BLAST, provided by NCBI)
Link services in a workflow enabling to achieve the desired task
Transfer data appropriately between services
Automatic
22 June 2013 12 Scientific Workflows: what do we miss?
Workflow for CABRI Network Services 22 June 2013 13 Scientific Workflows: what do we miss?
o Define XML languages with controlled vocabularies
o Archive data in XML formats
o Make use of Web Services for data exchange between services
o Associate data and analysis to proper items of an ontology of bioinformatics data, data types, and tasks
o Encode processes as workflows
Methodology: components
22 June 2013 14 Scientific Workflows: what do we miss?
Both industrial and academic WfMS are available and their use for Life Sciences is now widespread.
Biopipe, an add-on for bioperl GPipe, an extension of Pise
Taverna (EBI), a component of myGrid platform Pegasys (University of British Columbia) EGene (Universidade de São Paulo) Wildfire (Bioinformatics Institute, Singapore)
Pipeline Pilot (SciTegic) BioWBI, Bioinformatic Workflow Builder Interface (IBM)
Workflow Management Systems
22 June 2013 15 Scientific Workflows: what do we miss?
Software Type Standard License URL
Taverna Workbench Stand-alone XScufl Open source http://taverna.sourceforge.net/
Biopipe Libreria software Pipeline XML Open source http://www.gmod.org/biopipe/
ProGenGrid Stand-alone NA NA
http://datadog.unile.it/progen
DiscoveryNet Stand-alone DPML Commercial http://www.discovery-on-the.net/
Kepler Stand-alone MoML Open source http://kepler-project.org/
GPipe Interfaccia Web,
servizi locali
GPipe XML Open source http://if-
web1.imb.uq.edu.au/Pise/5.a/gpipe.html
EGene Stand-alone NA Open source http://www.lbm.fmvz.usp.br/egene/
BioWMS Interfaccia Web,
servizi remoti
XPDL Public use http://litbio.unicam.it:8080/biowms/
BioWEP Portale XScufl XPDL
Open source http://bioinformatics.istge.it/biowep/
BioWBI Interfaccia Web,
servizi locali
Proprietary Commerciale http://www.alphaworks.ibm.com/tech/biowbi
Pegasys Stand-alone Pegasys DAG Open source http://bioinformatics.ubc.ca/pegasys/
Wildfire Stand-alone GEL Open source http://wildfire.bii.a-star.edu.sg/wildfire/
Triana Stand-alone Triana Workflow
Language
Open source http://www.trianacode.org/
Pipeline Pilot Stand-alone Proprietary Commercial http://www.scitegic.com/
FreeFluo Libreria software WSFL e XScufl Open source http://freefluo.sourceforge.net/
Biomake Libreria software NA Open source http://skam.sourceforge.net/
Workflow Management Systems
Various software types and different standards
22 June 2013 16 Scientific Workflows: what do we miss?
Taverna Workbench is the best known and most adopted in life sciences Developed in the context of the myGrid platform Univ. Manchester and EBI main developers Open source at SourceForge.net
It allows to: Build and execute workflows for complex analysis … by getting access to remote and local services … displaying results in various formats … describing data through an ad-hoc ontology
Requirements: java plus Windows / Mac / Linux Open source: http://taverna.sourceforge.net/ Current version: 2.4
Taverna Workbench
22 June 2013 17 Scientific Workflows: what do we miss?
WfMS are increasingly used for data integration and analysis in biomedical research.
Here, we highlight some of current issues. Issues: Automatic composition of workflows Performances Reproducibility and reuse
WfMS: some current issues
22 June 2013 18 Scientific Workflows: what do we miss?
Researchers only care for scientific results!
Building workflows may be a burden Various skills are requested, and GUI do not
solve
Workflow composition should be much simpler, and become semi-automatic
Automatic composition
22 June 2013 19 Scientific Workflows: what do we miss?
Automatic composition
22 June 2013 20
Automatic composition
Automatic selection of
best services
Automated service
identification and composition
Adapters for different data
formats
Automatic conversion of
formats Ontology of methods, tools and data types
Integration with
repositories
Controlled Language Interface
Scientific Workflows: what do we miss?
Automatic composition
22 June 2013 21
Automatic composition
Automatic selection of
best services
Automated service
identification and composition
Adapters for different data
formats
Automatic conversion of
formats Ontology of methods, tools and data types
Integration with
repositories
Controlled Language Interface
Scientific Workflows: what do we miss?
A trade-off is required between rich semantic annotations and design complexity.
Semantic-based solutions available for controlled set of services.
Beyond Taverna
MyGrid team developed tools identification of services and supporting reuse of workflows
BioCatalogue
Annotated catalogue of Web Services for Life Science
MyExperiment
Repository of workflows for Life Science, enabled by social networking features
22 June 2013 22 Scientific Workflows: what do we miss?
Allows to define all: Data analysis tasks for bioinformatics
Data types
Possible relations betweeb tasks and data types (I/O)
Transformations between equivalent data (format)
Transformations between related data (through elaboration, e.g.: triplet AA, gene symbol sequence)
Fondamental in order to: Validate data flow and elaborations
Support automatic workflow composition
EDAM (EMBRACE Data and Methods) Ontology
EDAM Ontology
22 June 2013 23 Scientific Workflows: what do we miss?
EDAM (EMBRACE Data and Methods)
Topic: context of the analysis: domain of a study or an experiment
Operation: task carried out
Data: a data type used in bioinformatics
Format: a format used for encoding some data
http://edamontology.sourceforge.net/
EDAM Ontology
22 June 2013 24 Scientific Workflows: what do we miss?
Topic
Topic "A general bioinformatics subject or category, such as a field of
study, data, processing, analysis or technology.“
"Biological data resources“ "Nucleic acid analysis“
"Protein analysis“ "Sequence analysis“
"Structure analysis“ "Phylogenetics“
"Proteomics“ "Data handling“
"Chemoinformatics“ "Transcriptomics“
"Literature and reference“ "Ontologies, nomenclature and
"Immunoinformatics“ classification“
"Genetics“ "Systems biology"
"Ecoinformatics“ "Genomics"
22 June 2013 25 Scientific Workflows: what do we miss?
Operation
Operation "A function or process performed by a tool; what is done, but
not (typically) how or in what context."
"Alignment“ "Analysis and processing“
"Annotation“ "Classification“
"Comparison“ "Editing“
"Mapping and assembly“ "Modelling and simulation“
"Optimisation and refinement“ "Plotting and rendering“
"Prediction, detection and recognition“
"Search and retrieval“ "Validation and standardisation"
22 June 2013 26 Scientific Workflows: what do we miss?
Data
Data "A type of data in common use in bioinformatics." Include: Core data, Identifier, Parameter, report
"Alignment“ "Article“ "Biological model“
"Classification“ "Codon usage table“ "Data index“
"Data reference“ "Experimental measurement“
"Gene expression profile“ "Image“ "Map“ "Matrix“
"Microarray data“ "Molecular interaction“ "Molecular property“
"Ontology“ "Ontology concept“ "Pathway or
"Phylogenetic raw data“ "Phylogenetic tree“ network“
"Reaction data“ "Schema“ "Secondary structure“
"Sequence“ "Sequence motif“ "Sequence profile“
"Structural (3D) profile“ "Structure“ "Workflow"
22 June 2013 27 Scientific Workflows: what do we miss?
Format e Identifier
Format "A specific layout for encoding a specific type of data in a
computer file or memory."
"Binary“ "Format (typed)“ "HTML“ "RDF“
"Text“ "XML“
Identifier "A label that identifies (typically uniquely) something such as
data, a resource or a biological entity."
"Accession“ "Identifier (hybrid)“ "Identifier (typed)“
"Identifier with metadata“ "Name"
22 June 2013 28 Scientific Workflows: what do we miss?
Researchers want best possible results in the shortest possible time! No matter which database, site, computer are used Distributed nature of data sources (network issues, e.g. timeout and unavailabilty of sites) Large data volumes (reduced data transfer) Complex data analysis (implying HPC/cloud)
Perfomances
22 June 2013 29 Scientific Workflows: what do we miss?
Optimization of performances
22 June 2013 30
Optimization
Runtime error detection
Task-level failure
recovery
Evaluation of alternative
services Task dependency
analysis & flow parallelization
Parallelization on cluster
or HPC architecture
Scientific Workflows: what do we miss?
Optimization of performances
22 June 2013 31
Optimization
Runtime error detection
Task-level failure
recovery
Evaluation of alternative
services Task dependency
analysis & flow parallelization
Parallelization on cluster
or HPC architecture
Scientific Workflows: what do we miss?
Alternative services SRS by Web Services (SWS) provides access to public SRS implementations by selecting the most up-to-date, working site for any given database
Reproducibility of analysis in life sciences is fundamental!
Dependency on current contents of databases
Dependency on the current status and variability of tools
NB! Perfect reproducibility in-silico is impossible!
Reuse of intermediate results and procedures
Reproducibility and reuse
22 June 2013 32 Scientific Workflows: what do we miss?
Reproducibility & reuse
22 June 2013 33
Reproducibility and reuse of
results
State of databases and
tools
Prospective provenance
data
Retrospective provenance
data Reuse of intermediate
results
Caching
Scientific Workflows: what do we miss?
Reproducibility & reuse
22 June 2013 34
Reproducibility and reuse of
results
State of databases and
tools
Prospective provenance
data
Retrospective provenance
data Reuse of intermediate
results
Caching
Scientific Workflows: what do we miss?
Prospective provenance Workflow structural model, dependencies from services, databases, or software libraries, systems dependencies
Retrospective provenance Observations from run time events: data produced and consumed and services accessed
In collaboration with
Paolo MISSIER School of Computing Sciences, Newcastle University, UK
Thanks!
22 June 2013 35 Scientific Workflows: what do we miss?