Upload
many87
View
412
Download
0
Tags:
Embed Size (px)
Citation preview
Paul Fisher
University of Manchester
An Introduction to Web Services and Scientific Workflows
Overview
• Current analysis techniques• Issues with manual analyses• Web Services• Workflows as Scientific protocols• Workflow sharing, re-use, and repurposing in
myExperiment• Service discovery with Feta and BioCatalogue
• Later – Practical session for hands-on
Manual analysis techniques
• Nucleic Acids Research (2009) - over 1170 databases
• Specialist software applications
• Navigating between software resources
• Cut and Paste of data
• Screen scraping of Web pages
• Scripting in Perl / Java / Python / C++
[insert another language here]
Manual Methods of data analysis
Issues in analysis techniques
Manual Methods of data analysis
Navigating through hyperlinks
No explicit methods
Human error
Tedious and repetitive
Implicit methods
Huge amounts of data
200+ Genes
Region on chromosome
Microarray
1000+ Genes
How do I look at ALL the genes systematically?
Hypothesis-Driven Analyses
200 genes
Pick the genes involved in immunological process
40 genesPick the genes that I am most familiar with
2 genes
Biased view
‘Cherry Pick’ genes
Issues with current approaches
• Scale of analysis task overwhelms researchers – lots of data
• User bias and premature filtering of datasets – cherry picking
• Hypothesis-Driven approach to data analysis
• Constant changes in data - problems with re-analysis of data
• Implicit methodologies (hyper-linking through web pages)
• Error proliferation from any of the listed issues – notably human error
Solution Automate
Automate using the Two W’s
• Web Services– Technology and standard for exposing code and data
resources by an means that can be consumed by a third party remotely
– Describes how to interact with it, e.g. service parameters
• Workflows– General technique for describing and executing a process– Describes what you want to do, including the services to use
Web Services
ClientApplication
ClientApplication
SOAPWSDLRemote
Application
HTTP Request
HTTP Response
HTTP Request HTTP Response
Web Service Description Language
• Web Service Description Language (WSDL) is used to provide a computer program with enough information on how to execute or provide data to a remote resource
• XML based language
• Can be used for most industry programming languages including Perl, C++, Java
• Tells external programs how to call a remote service – exposes function calls
Programmatic Interfaces to Services(Web Services not Web Sites)
Your Script
ServiceRegistry
Web Service
SeqFetchService
BLAT Service
BLAST Service
SeqFetchService
GO Service
Your Workflow
Your Application
Interface Description Document
WSDL WADL
What types of service?
• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Local Java services• Beanshell• Workflows
Workflows
• Collection of tasks chained together to perform one overall operation – e.g. the ‘morning ritual’ workflow
1. Get up
2. Have a wash
3. Get dressed
4. Eat breakfast
5. Clean teeth
6. Go to lectures
• High level description of your experiment– Inputs, programs, outputs (and intermediate inputs and outputs)
• Workflow is the model of experiment– Methods section in your publication
What is a Workflow?
• Workflows provide a general technique for describing and enacting a process
• Describes what you want to do, and how you want to do it
• Specifies how bioinformatics processes fit together
• Processes are represented as web services
RepeatMasker
Web service
GenScanWeb Service
BlastWeb Service
Sequence Predicted Genes out
Remove repeats
Find genes
Find orthalogues
Kepler
Triana
BPELPtolemy II
Taverna
Pipeline Pilot
The Taverna Workflow Workbench
What is Taverna?
“Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments”. – Someone (sometime)
OR
“Allows you to build and run workflows”. – Paul Fisher (2009)
• Access to local and remote resources and analysis tools• Automation of data flow between services• Iteration over large data sets…….. And so on
http://www.mygrid.org.uk/
Taverna Workflow Workbench
Who uses Taverna?
• Over 60,000 downloads• Systems biology• Medical image analysis• Heart simulations• High throughput
screening• Genotype/Phenotype
studies• Health Informatics• Astronomy• Chemoinformatics
NOT FOR BOLOGISTS!!!!
Prof. Andy Brass
Designed for informaticians – computer savvy people
What do Scientists use Taverna for?
• Data gathering and annotating– Distributed data and knowledge
• Building models and knowledge management– Populating SBML or hypothesis generation
• Data analysis– Distributed analysis tools and high throughput
Data Gathering
• Collecting evidence from lots of places• Accessing local and remote databases, extracting info
and displaying a unified view to the user
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
Lots of outputs!!
Annotation Pipelines
• Genome annotation pipelines • Workflow assembles evidence for predicted genes / potential
functions
• Human expert can ‘review’ this evidence before submission to the genome database
• Data warehouse pipelines• e-Fungi – model organism warehouse• ISPIDER – proteomics warehouse
• Annotating the up/down regulated genes in a microarray experiment
Systems Biology Model Construction
Automatic reconstruction of genome-scale yeast metabolism from distributed data in the life sciences to create and manipulate Systems Biology Markup Models.
Read enzyme names from
SBML
Read enzyme names from
SBML
Query maxdLoad2
using enzyme names
Query maxdLoad2
using enzyme names
Calculate colours based on gene
expn level
Calculate colours based on gene
expn level
Create new SBML model
with new colour nodes
Create new SBML model
with new colour nodes
Integration of microarray data with SBML
Data Analysis
• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge
http://www.genomics.liv.ac.uk/tryps/trypsindex.html
Trypanosomiasis in Africa
An
dy Brass
Steve
Ke
mp
+ many Others
A Systematic Strategy for Large-Scale Analysis of Genotype-Phenotype Correlations: Identification of
candidate genes involved in African Trypanosomiasis
Fisher et al., (2007) Nucleic Acids Research doi:10.1093/nar/gkm623
• Explicitly discusses the methods we used for the Trypanosomiasis use case
• Discussion of the results for Daxx and shows mutation
• Sharing of workflows for re-use, re-purposing
Trichuris muris
• Mouse whipworm infection - parasite model of the human parasite - Trichuris trichuria
• Understanding Phenotype– Comparing resistant vs. susceptible strains – Microarrays
• Understanding Genotype– Mapping quantitative traits – Classical genetics QTL (regions of
chromosome)
Recycling, Reuse, Repurposing
Here’s the Science!
• Identified a candidate gene (Daxx) for Trypanosomiasis resistance.
• Manual analysis on the microarray and QTL data failed to identify this gene as a candidate.
• Unbiased analysis. Confirmed by the wet lab.
Here’s the e-Science!
• Trypanosomiasis mouse workflow reused without change in Trichuris muris infection in mice
• Identified biological pathways involved in sex dependence
• Previous manual two year study of candidate genes had failed to do this.
Workflows now being run over Colitis/ Inflammatory Bowel Disease in Mice (without change)
• Scale of analysis task overwhelms researchers – lots of data Handled by computers
• User bias and premature filtering of datasets – cherry picking All data processed systematically
• Hypothesis-Driven approach to data analysis Computers know nothing of hypotheses and so process the data
independent of any prior judgments
• Constant changes in data - problems with re-analysis of data Saved workflow can be re-run at any point, over new data sets
• Implicit methodologies (hyper-linking through web pages) Methodology has been captured in the workflow itself
Was the Workflow Approach Successful?
Social Networking for Scientists
Recycling, Reuse, Repurposing
http://www.myexperiment.org/
• Share
• Search
• Re-use
• Re-purpose
• Execute
• Communicate
• Record
Sharing Experiments
• myGrid supports the in silico experimental process for individual scientists
• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community
• How do you compare your results with others produced by e.g. Kepler / Triana?
Just Enough Sharing….
• myExperiment can provide a central location for workflows from one community/group
• myExperiment allows you to say– Who can look at your workflow– Who can download your workflow– Who can modify your workflow– Who can run your workflow
Remote Execution of Workflows
Service Discovery
Finding Services
There are over 3500 distributed services. How do we find an appropriate one?
• We need to annotate services by their functions (and not their names!)
• The services might be distributed, but a registry of service descriptions can be central and queried
• Annotated with terms from the myGrid ontology• Questions we can ask: Find me all the services that
perform a multiple sequence alignment and accept protein sequences in FASTA format as input
myGrid Ontology
Logically separated into two parts:• Service ontology
Physical and operational features of web services
• Domain ontologyVocabulary for core bioinformatics data, data types and their relationships
Ontology developed in OWL
myGrid ontology
• Example : BLAST (from the DDBJ)– Performs task: Alignment– Uses Method: Similarity Search Algorithm– Uses Resources: DNA/Protein sequence databases– Inputs:
• biological sequence
• database name
• blast program
– Outputs: Blast Report
Feta Search Result
Curation by Experts
Curation by the Community
Automated Curation
refinevalidate
refinevalidate
Curation by Developers
seed seed
refinevalidate
seed
BioCatalogueJoint Manchester-EBI
Summary
Taverna workflows:• Combine local and remote resource and analysis tools• Automate multi-step processes • Iterate over large data sets
myExperiment• Provides reusable protocols for in silico science• Enables sharing of workflows and expertise• Provides an alternative way of running workflows
– Not everyone who runs workflows wants to build workflows or see workflows running
Acknowledgements
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop
• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.
• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson
• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.
• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.
http://www.mygrid.org.uk
http://www.myexperiment.org