Link

Paul Fisher

University of Manchester

An Introduction to Web Services and Scientific Workflows

Overview

• Current analysis techniques• Issues with manual analyses• Web Services• Workflows as Scientific protocols• Workflow sharing, re-use, and repurposing in

myExperiment• Service discovery with Feta and BioCatalogue

• Later – Practical session for hands-on

Manual analysis techniques

• Nucleic Acids Research (2009) - over 1170 databases

• Specialist software applications

• Navigating between software resources

• Cut and Paste of data

• Screen scraping of Web pages

• Scripting in Perl / Java / Python / C++

[insert another language here]

Manual Methods of data analysis

Issues in analysis techniques

Manual Methods of data analysis

Navigating through hyperlinks

No explicit methods

Human error

Tedious and repetitive

Implicit methods

Huge amounts of data

200+ Genes

Region on chromosome

Microarray

1000+ Genes

How do I look at ALL the genes systematically?

Hypothesis-Driven Analyses

200 genes

Pick the genes involved in immunological process

40 genesPick the genes that I am most familiar with

2 genes

Biased view

‘Cherry Pick’ genes

Issues with current approaches

• Scale of analysis task overwhelms researchers – lots of data

• User bias and premature filtering of datasets – cherry picking

• Hypothesis-Driven approach to data analysis

• Constant changes in data - problems with re-analysis of data

• Implicit methodologies (hyper-linking through web pages)

• Error proliferation from any of the listed issues – notably human error

Solution Automate

Automate using the Two W’s

• Web Services– Technology and standard for exposing code and data

resources by an means that can be consumed by a third party remotely

– Describes how to interact with it, e.g. service parameters

• Workflows– General technique for describing and executing a process– Describes what you want to do, including the services to use

Web Services

ClientApplication

ClientApplication

SOAPWSDLRemote

Application

HTTP Request

HTTP Response

HTTP Request HTTP Response

Web Service Description Language

• Web Service Description Language (WSDL) is used to provide a computer program with enough information on how to execute or provide data to a remote resource

• XML based language

• Can be used for most industry programming languages including Perl, C++, Java

• Tells external programs how to call a remote service – exposes function calls

Programmatic Interfaces to Services(Web Services not Web Sites)

Your Script

ServiceRegistry

Web Service

SeqFetchService

BLAT Service

BLAST Service

SeqFetchService

GO Service

Your Workflow

Your Application

Interface Description Document

WSDL WADL

What types of service?

• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Local Java services• Beanshell• Workflows

Workflows

• Collection of tasks chained together to perform one overall operation – e.g. the ‘morning ritual’ workflow

1. Get up

2. Have a wash

3. Get dressed

4. Eat breakfast

5. Clean teeth

6. Go to lectures

• High level description of your experiment– Inputs, programs, outputs (and intermediate inputs and outputs)

• Workflow is the model of experiment– Methods section in your publication

What is a Workflow?

• Workflows provide a general technique for describing and enacting a process

• Describes what you want to do, and how you want to do it

• Specifies how bioinformatics processes fit together

• Processes are represented as web services

RepeatMasker

Web service

GenScanWeb Service

BlastWeb Service

Sequence Predicted Genes out

Remove repeats

Find genes

Find orthalogues

Kepler

Triana

BPELPtolemy II

Taverna

Pipeline Pilot

The Taverna Workflow Workbench

What is Taverna?

“Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments”. – Someone (sometime)

OR

“Allows you to build and run workflows”. – Paul Fisher (2009)

• Access to local and remote resources and analysis tools• Automation of data flow between services• Iteration over large data sets…….. And so on

http://www.mygrid.org.uk/

Taverna Workflow Workbench

Who uses Taverna?

• Over 60,000 downloads• Systems biology• Medical image analysis• Heart simulations• High throughput

screening• Genotype/Phenotype

studies• Health Informatics• Astronomy• Chemoinformatics

NOT FOR BOLOGISTS!!!!

Prof. Andy Brass

Designed for informaticians – computer savvy people

What do Scientists use Taverna for?

• Data gathering and annotating– Distributed data and knowledge

• Building models and knowledge management– Populating SBML or hypothesis generation

• Data analysis– Distributed analysis tools and high throughput

Data Gathering

• Collecting evidence from lots of places• Accessing local and remote databases, extracting info

and displaying a unified view to the user

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt

Lots of outputs!!

Annotation Pipelines

• Genome annotation pipelines • Workflow assembles evidence for predicted genes / potential

functions

• Human expert can ‘review’ this evidence before submission to the genome database

• Data warehouse pipelines• e-Fungi – model organism warehouse• ISPIDER – proteomics warehouse

• Annotating the up/down regulated genes in a microarray experiment

Systems Biology Model Construction

Automatic reconstruction of genome-scale yeast metabolism from distributed data in the life sciences to create and manipulate Systems Biology Markup Models.

Read enzyme names from

SBML

Read enzyme names from

SBML

Query maxdLoad2

using enzyme names

Query maxdLoad2

using enzyme names

Calculate colours based on gene

expn level

Calculate colours based on gene

expn level

Create new SBML model

with new colour nodes

Create new SBML model

with new colour nodes

Integration of microarray data with SBML

Data Analysis

• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

Trypanosomiasis in Africa

An

dy Brass

Steve

Ke

mp

+ many Others

A Systematic Strategy for Large-Scale Analysis of Genotype-Phenotype Correlations: Identification of

candidate genes involved in African Trypanosomiasis

Fisher et al., (2007) Nucleic Acids Research doi:10.1093/nar/gkm623

• Explicitly discusses the methods we used for the Trypanosomiasis use case

• Discussion of the results for Daxx and shows mutation

• Sharing of workflows for re-use, re-purposing

Trichuris muris

• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria

• Understanding Phenotype– Comparing resistant vs. susceptible strains – Microarrays

• Understanding Genotype– Mapping quantitative traits – Classical genetics QTL (regions of

chromosome)

Recycling, Reuse, Repurposing

Here’s the Science!

• Identified a candidate gene (Daxx) for Trypanosomiasis resistance.

• Manual analysis on the microarray and QTL data failed to identify this gene as a candidate.

• Unbiased analysis. Confirmed by the wet lab.

Here’s the e-Science!

• Trypanosomiasis mouse workflow reused without change in Trichuris muris infection in mice

• Identified biological pathways involved in sex dependence

• Previous manual two year study of candidate genes had failed to do this.

Workflows now being run over Colitis/ Inflammatory Bowel Disease in Mice (without change)

• Scale of analysis task overwhelms researchers – lots of data Handled by computers

• User bias and premature filtering of datasets – cherry picking All data processed systematically

• Hypothesis-Driven approach to data analysis Computers know nothing of hypotheses and so process the data

independent of any prior judgments

• Constant changes in data - problems with re-analysis of data Saved workflow can be re-run at any point, over new data sets

• Implicit methodologies (hyper-linking through web pages) Methodology has been captured in the workflow itself

Was the Workflow Approach Successful?

Social Networking for Scientists

Recycling, Reuse, Repurposing

http://www.myexperiment.org/

• Share

• Search

• Re-use

• Re-purpose

• Execute

• Communicate

• Record

Sharing Experiments

• myGrid supports the in silico experimental process for individual scientists

• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community

• How do you compare your results with others produced by e.g. Kepler / Triana?

Just Enough Sharing….

• myExperiment can provide a central location for workflows from one community/group

• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow– Who can run your workflow

Remote Execution of Workflows

Service Discovery

Finding Services

There are over 3500 distributed services. How do we find an appropriate one?

• We need to annotate services by their functions (and not their names!)

• The services might be distributed, but a registry of service descriptions can be central and queried

• Annotated with terms from the myGrid ontology• Questions we can ask: Find me all the services that

perform a multiple sequence alignment and accept protein sequences in FASTA format as input

myGrid Ontology

Logically separated into two parts:• Service ontology

Physical and operational features of web services

• Domain ontologyVocabulary for core bioinformatics data, data types and their relationships

Ontology developed in OWL

myGrid ontology

• Example : BLAST (from the DDBJ)– Performs task: Alignment– Uses Method: Similarity Search Algorithm– Uses Resources: DNA/Protein sequence databases– Inputs:

• biological sequence

• database name

• blast program

– Outputs: Blast Report

Feta Search Result

Curation by Experts

Curation by the Community

Automated Curation

refinevalidate

refinevalidate

Curation by Developers

seed seed

refinevalidate

seed

BioCatalogueJoint Manchester-EBI

Summary

Taverna workflows:• Combine local and remote resource and analysis tools• Automate multi-step processes • Iterate over large data sets

myExperiment• Provides reusable protocols for in silico science• Enables sharing of workflows and expertise• Provides an alternative way of running workflows

– Not everyone who runs workflows wants to build workflows or see workflows running

Acknowledgements

Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

http://www.mygrid.org.uk

http://www.myexperiment.org

http://www.mygrid.org.uk/