22
06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid myGrid University of Manchester What is Taverna? Taverna enables the interoperation between databases Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments Access to local and remote resources and analysis tools Automation of data flow Iteration over large data sets

Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

1

Taverna: A Workbench for the Design and Execution of Scientific Workflows

Katy WolstencroftmyGridmyGrid

University of Manchester

What is Taverna?

Taverna enables the interoperation between databasesTaverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments

• Access to local and remote resources and analysis tools• Automation of data flow• Iteration over large data sets

Page 2: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

2

What is a workflow?

• Business Process workflows – mouse tracking systems– Tasks Schedules dependencies (on staff time) and costsTasks, Schedules, dependencies (on staff time), and costs

• Scientific Workflows – on in silico data– Data throughput, dependencies (on analysis results)– Input, algorithm, output– Flow of information, scheduling of order, collection of results,

intermediate results and provenance

• High level description of your experimentHigh level description of your experiment• Workflow is the model of experiment

– Methods section in your publication

Taverna and myGrid

• myGrid a suite of components designed to support in• myGrid a suite of components designed to support in silico experiments in biology

• Taverna workbench – main user interface • Semantic service discovery components• myGrid Ontology for bioinformatics services• Provenance components• myGrid provenance ontology

Page 3: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

3

OMII-UK

• University of Manchester (myGrid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006

• OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaboratorsinternational collaborators.

• A guarantee of development and support

Workflow diagram

Available services

Tree view of workflow structure

Page 4: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

4

Who Provides the Services?

• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

Quality Web• Quality Web Services considered desirable

What types of service?

• WSDL Web Services• BioMart• BioMart • R-processor• BioMoby• Soaplab• Local Java services• BeanshellBeanshell• Workflows

Page 5: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

5

Who uses Taverna?

~38800 downloadsUsers worldwide • Systems biology• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Genotype/Phenotype studies• Health Informatics

A t• Astronomy• Chemoinformatics• Data integration

• ISMB07 – 6 posters, 2 demos,1 BOF, 1 tutorial

What do Scientists use Taverna for?

• Data gathering and annotating• Data gathering and annotating– Distributed data and knowledge

• Data analysis– Distributed analysis tools and

• Data mining and knowledge management– Hypothesis generation and modelling

Page 6: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

6

Data Gathering

• Collecting evidence from lots of places• Accessing local and remote databases extracting info• Accessing local and remote databases, extracting info

and displaying a unified view to the user

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccatttattttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Case Study – Graves Disease

• Autoimmune disease that causes hyperthyroidism • Antibodies to the thyrotropin receptor result in• Antibodies to the thyrotropin receptor result in

constitutive activation of the receptor and increased levels of thyroid hormone

• Original myGrid Case Study

Ref: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R Pearce S and Wipat A (2004) Association of variations inR, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004

Page 7: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

7

Graves Disease

The experiment:The experiment: • Analysing microarray data to determine genes

differentially-expressed in Graves Disease patients and healthy controls

• Characterising these genes (and any proteins encoded by them) in an annotation pipeline

• From affymetrix probeset identifier, extract information about genes encoded in this region.

• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement

Annotation Pipeline

Evidence includes:Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology termsGene Ontology terms

Page 8: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

8

Output 1

id, strand, position, -, gact-pos, aa-posrs11557924,1,31930994,coding,140,47rs11557924,1,31930994,coding,140,47rs11557921,1,31930996,coding,142,48rs17855850,1,31931855,coding,1001,334rs483638,1,31931855,coding,1001,334rs17854926,1,31931904,coding,1050,350rs35770727,1,31931925,coding,1071,357rs34963839 1 31931940 coding 1086 362rs34963839,1,31931940,coding,1086,362rs33998554,-1,31932054,coding,1200,400

Page 9: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

9

Utopia

Shared Semantics

• Taverna and Utopia share the same semantic model • Based on the myGrid ontology and the Feta semantic• Based on the myGrid ontology and the Feta semantic

discovery components• Services in workflows are described using terms from

the myGrid ontology• Service descriptions are queried using Feta

– Service function, inputs/outputs

Page 10: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

10

An Automatic Annotation Pipeline

• Genome annotation pipelines workflow assembles• Genome annotation pipelines – workflow assembles evidence for predicted genes / potential functions

• Human expert can ‘review’ this evidence before submission to the genome database

Collaboration with the Bergen Center for Computational Science (computational biology unit) – Gene Prediction in Algal Viruses, a case study. Presented at NETAB2005http://www.nettab.org/2005/docs/NETTAB2005_LanzenPoster.pdf

User Interaction During the workflow

Page 11: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

11

Data Analysis

• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge

Trypanosomiasis in Africa Steve

Andy B

rassK

emp

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

sP

aul Fisher

Page 12: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

12

Trypanosomiasis Study

• A form of Sleeping sickness in cattle Known as n’gana• A form of Sleeping sickness in cattle – Known as n gana• Caused by Trypanosoma brucei

• Some cattle breeds more resistant than others• What are the differences between resistant and

susceptible cattle?• Can we breed cattle resistant to n’gana infection?

Trypanosomiasis Study

Understanding PhenotypeUnderstanding Phenotype• Comparing resistant vs susceptible strains – MicroarraysUnderstanding Genotype• Mapping quantitative traits – Classical genetics QTL

Need to access microarray data,Need to access microarray data, genomic sequence information, pathway databases AND integrate the results

Page 13: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

13

Genotype Phenotype

?

Microarray + QTL

Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping

Genes captured in microarray experiment and present in QTL region

Key:

A – Retrieve genes in QTL region

B – Annotate genes with external database Ids

C – Cross-reference Ids with KEGG gene ids

D – Retrieve microarray data from MaxD database

E – For each KEGG gene get the pathways it’s g p yinvolved in

F – For each pathway get a description of what it does

G – For each KEGG gene get a description of what it does

Page 14: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

14

Results

• Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistanceis believed to play a role in trypanosomiasis resistance.

• Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate.

• Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved incorrelations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33

Why was the Workflow Approach Successful?

• Workflow analysed each piece of data systematically• Workflow analysed each piece of data systematically– Eliminated user bias and premature filtering of datasets and

results leading to single sided, expert-driven hypotheses

• The size of the QTL and amount of the microarray data made a manual approach impractical

• Workflows capture exactly where data came from and how it was analysedhow it was analysed

• Workflow output produced a manageable amount of data for the biologists to interpret and verify– “make sense of this data” -> “does this make sense?”

Page 15: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

15

Trichuris muris(mouse whipworm) infection

parasite model of the human parasite - Trichuris trichuria)

• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• Manual experimentation: Two year study of candidate genes, processes unidentified

• Workflows: trypanosomiasis cattle experiment was reused without change.

• Analysis of the resulting data by a biologist found the processes in a couple of days.

Joanne Pennock, Richard GrencisUniversity of Manchester

Prime Minister's Office Thailand Center of Excellence for Life Sciences (TCELS)

Pharmacogenomics project

Wasun ChantratitaProject director

2003->2006

Page 16: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

16

Pharmacogenomics

• Heavy use of R-Statistics for clinical data analysis• Association study of Nevirapine induced skin rash in• Association study of Nevirapine-induced skin rash in

Thai Population• A systemic (bodywide) allergic reaction with a

characteristic rash– 100 Cases: rash – 100 Cases: no rash controls– 10,000 SNP significantly associated with rash

Pathway analysis and systems biology– Pathway analysis and systems biology– Prioritising SNPs– Functional studies– Diagnostic tools

Data Mining and Knowledge Integration

Page 17: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

17

My BioAID: Hypothesis Construction from the Literature

Marco Roos and Scott Marshall Adaptive InformationMarco Roos and Scott Marshall, Adaptive Information Disclosure, Faculty of Science, University of Amsterdam

• Combines text mining with ontology modelling• AIDA: text mining and machine learning toolbox

Start with a Proto-Ontology

Small amount of information about a topic of interest

e.g. review article about histones and disease

Page 18: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

18

Mine the literature for more…

The workflow produces OMIM tagged diseases which can be used to enrich the proto-ontology automatically in RDF

TaWeka - Taverna to Weka

• Luna De Ferrari and Igor Goryanin– University of Edinburgh

• Combines Taverna workflows with Weka data mining tools

• Produces classifiers for biological knowledge– Predict circadian expression of

genes in plants– Support curation of metabolic

pathways• Promotes sharing and reuse of

classifiers because both the data and knowledge workflow are tracked

Page 19: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

19

Sharing Experiments

• myGrid supports the in silico experimental process for individual scientistsindividual scientists

• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community

• How do you compare your results with others produced• How do you compare your results with others produced by e.g. Kepler / Triana?

Page 20: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

20

Collaborative, Social Bookmarkingg

Content Sharing

Application Execution

Social Recommendations

Collaborative, Social Tagging

Page 21: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

21

Summary

Taverna• allows interoperation between local and remote resources• allows interoperation between local and remote resources• allow automated access or analysis to sets of data• helps with data integration• Is extensible and open source – for application embedding

MyExperimentMyExperiment• Allows sharing across particular communities• Provides a central location for publishing/finding useful

workflows

myGrid acknowledgements

Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis Alvaro Fernandes Justin Ferris Robert Gaizaukaus Kevin Glover ChrisDavis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

http://www.mygrid.org.uk

http://www.myexperiment.org

Page 22: Taverna: A Workbench for the Design and Execution …06.10.2007 1 Taverna: A Workbench for the Design and Execution of Scientific Workflows Katy Wolstencroft myGrid University of Manchester

06.10.2007

22

• Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematicKemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33

Links