25
Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Assessing Galaxy’s ability to express scientific workflows in bioinformatics Peter van Heusden and Alan Christoffels South African National Bioinformatics Institute University of the Western Cape Bellville, South Africa 10 th FASTAR/Espresso Workshop 2013 / 4-6 November 2013 Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Assessing Galaxy's ability to express scientific workflows in bioinformatics

Embed Size (px)

Citation preview

Page 1: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Assessing Galaxy’s ability to express scientificworkflows in bioinformatics

Peter van Heusden and Alan Christoffels

South African National Bioinformatics InstituteUniversity of the Western Cape

Bellville, South Africa

10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 2: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

What is bioinformatics?

Bioinformatics is the discipline of solving problems in biology andmedicine using computational resources.

Within bioinformatics, biological sequence analysis (BSA)describes those analyses that “infer biological information fromsequence alone”. (Durbin, 1998)

Cost of biological sequence analysis has two parts:1 Cost of acquiring sequence2 Cost of analysing sequence

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 3: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Cost of acquiring sequence

(Wetterstrand, 2013)Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 4: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Cost of analysing sequence

The “sudden reliance on computation has created an ‘informaticscrisis’ for life science researchers: computational resources canbe difficult to use, and ensuring that computational experimentsare communicated well and hence reproducible is challenging”(Goecks et al., 2010)

As cost of sequencing plummets analysis faces two challenges:1 Growing data volume demands more sophisticated computational

approaches2 Translating biological questions into computational workflows

remains difficult

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 5: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

How do we do bioinformatics?

Given a set of protein sequences from species A, which genesfrom species B produce similar proteins, and where are thesegenes located on the genome of B?Analysis proceeds (Stevens et al., 2001) using:

1 Collections of data objects2 Transformers that generate new collections (e.g. transform

collection of proteins into collection of genome regions that theymatch)

3 Filters (e.g. discard low quality matches to genome)

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 6: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

How we do bioinformatics (2)

Data collections typically exist as (compressed) files

Bioinformatics tools typically are command line executables thataccept and generate files (often using ad-hoc formats)Scripting languages (Perl, Python) used to compose workflows,APIs often used for reading/writing file formats

1 Workflow enactment often involves manual steps and is closelytied to execution environment

2 Workflow is not easily reproducible nor reusable

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 7: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Scientific workflow management systems

Scientific workflow management systems (SciWMS) have beenproposed as an alternative to current script-based approaches toanalysis workflow.

SciWMSs “provide a high-level declarative way of specifying whata particular in silico experiment modelled by a workflow is set toachieve, not how it will be executed.” (Taverna project, 2009)

Workflow descriptions resemble dataflow languages (McPhillipset al., 2009)

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 8: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

The promise of SciWMSs

In addition to workflow specification, SciWMSs sometimes offer:

Types that model objects of scientific domain

Recording of provenance of data objects

Execution of scientific workflows on diverse computingenvironments (desktop, cluster, grid, cloud)

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 9: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

SciWMSs for bioinformatics

Many workflow systems have been proposed for use inbioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle,BiosFlow, bpipe

Some workflow features are also available in Galaxy

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 10: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

What is Galaxy

Galaxy emerged in 2004/5 as a web interface to bioinformaticstools and dataGalaxy is becoming common platform through which to “publish”tools and data

More than 30 known public Galaxy servers36 000 users on main public Galaxy server, 0.8 Pb of data

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 11: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Galaxy as an open-source project

Galaxy consists of c. 250 000 lines of (mostly Python) code

Core team includes 15 developers spread across 4 differentinstitutes

Development is open source and “out in the open” with codehosted on BitBucket, development planning on Trello and mailinglists

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 12: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Galaxy I

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 13: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Galaxy II

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 14: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Galaxy workflow management features

Galaxy allows composition of workflows defined as series oftasks and related dataflow

Allows execution of workflows on local machine or via various jobschedulers

Data objects generated in Galaxy have associated provenanceinformation

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 15: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Limitations of Galaxy as a SciWMS

Limited support for scientific workflow patternsType refers to format of data items

Provenance is recorded as attribute of data files

Workflows are not first class objectsAnalysis view focuses on individual datasetsExecution engine schedules tasks (with limited support for taskcollections)

Galaxy can be enriched by drawing on prior research onSciWMSs

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 16: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Scientific workflow patterns

Analysis of scientific workflows has yielded a set of designpatterns used in workflows (Yildiz et al., 2009)

Galaxy workflow language supports sequential dataflow, parallelsplit and synchronisationTool definition language has recently been extended to supportmultiple instances of task (not workflow) execution with a-prioriruntime knowledge

Tool authors can signal that input to tool can be split for parallelexecutionNo interface between workflow authors and multiple instancesupport

Support for cancel of individual task but not entire workflowNo support for triggering new thread of activity (restart)

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 17: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Scientific workflow patterns (2)

No support for exclusive choice (e.g. execute different dataflowpath based on different input)No support for sub-workflows

Galaxy workflow language is “abstraction hating” (Green andPetre, 1996)Leads to workflow diagrams resembling bowl of spaghetti foranything but the most simple cases

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 18: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

The Galaxy type system

Galaxy types represent file typesFile type does not map simply to semantics

Collection types are not supported, although some types are“splittable” to allow parallel task executionWorkflow parameters are not supported via type system

Cannot guarantee that workflow is well-formedProvenance recording is coarse-grained

What will happen if we update single element of input datacollection?

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 19: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Science questions vs execution plans

Type system could model scientific domain objects (e.g. proteinand nuceleotide sequences) but . . .

Bioinformatics tools do not support standard formats or supportstandard formats with quirksNot clear what information to save from tool output

Experienced bioinformaticists want opportunity to review “rawoutput” to explore factors that underpin confidence in analysis

Need to support both recording and reporting of workflow outputBoth recording “raw” output trace and reporting provenance ofscientific domain objects are necessary features for SciWMS

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 20: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Workflow execution in Galaxy

Internally workflows are expanded into collections of tasks atexecution timeTasks are executed by backend classes: either local or viaschedulerExecution parameters can be set by “dynamic job runners”

Allows e.g. resource requirements of job to be signalled toschedulerConfigured using a combination of XML and Python codemaintained by Galaxy administrator

Workflow execution leaves no visible trace in the user interfaceAt runtime execution shows individual jobs runningData objects are grouped by “history”, not associated with aworkflow

No support for re-execution of part of workflow

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 21: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Support for workflow patternsScientific Data ModellingWorkflow representation and use

Scope for workflow optimisation

Workflows are dataflow graphs (Johnston et al., 2004)

Knowledge of inputs and types can be used to plan executionefficiently, e.g. pipeline tasks and exploit opportunities forstreamingCollection of data objects and parameters sets can be exploitedfor automatic parallel enactment of tasks and sub-workflows

Data collections and workflows provide structures for nesting ofprovenance information

Knowledge of data provenance could facilitate lifecycle of dataproducts: kept for re-use or discarded as “intermediate products”

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 22: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Conclusion

Bioinformatics faces an “informatics crisis” as cost to generatesequence has decreased while cost to compose or reproduceanalysis has remained highGalaxy has emerged as a popular interface to bioinformatics toolsand data with workflow management featuresInsight from prior research on SciWMSs suggests areas forenhancement:

Support for additional workflow patternsExtension of type system with support for biological types,collections and parameter setsImprovement of workflow execution through treating workflows asfirst class objects with associated optimisation of execution andprovenance storage

Currently being pursued as a research agenda at SANBIPeter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 23: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Thanks

Workflows for biological se-quence analysis are discussedby the “Pipelines collaboration”

Research on SciWMS supportedby the MRC and Prof Christoffels

Professor Alan Christoffels

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 24: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Bibliography I

R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.Cambridge University Press, Apr. 1998. ISBN 9780521629713.

J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach forsupporting accessible, reproducible, and transparent computational research in the lifesciences. Genome Biol, 11(8), 2010.

T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitivedimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996.

W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages.ACM Computing Surveys, 36(1):1–34, Mar. 2004.

T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals.Future Generation Computer Systems, 25(5):541–551, May 2009.

R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics.Bioinformatics, 17(2):180–188, Feb. 2001.

Taverna project. Why use workflows?, 2009. URLhttp://www.taverna.org.uk/introduction/why-use-workflows/.

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Page 25: Assessing Galaxy's ability to express scientific workflows in bioinformatics

IntroductionBiological Sequence Analysis

Scientific workflow management systemsThe Galaxy framework

ConclusionBibliographyReferences

Bibliography II

K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program(GSP), 2013. URL http://www.genome.gov/sequencingcosts/.

U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings ofthe 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page13:1–13:10, New York, NY, USA, 2009. ACM.

Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics