Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
06.10.2007
1
Taverna: A Workbench for the Design and Execution of Scientific Workflows
Katy WolstencroftmyGridmyGrid
University of Manchester
What is Taverna?
Taverna enables the interoperation between databasesTaverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments
• Access to local and remote resources and analysis tools• Automation of data flow• Iteration over large data sets
06.10.2007
2
What is a workflow?
• Business Process workflows – mouse tracking systems– Tasks Schedules dependencies (on staff time) and costsTasks, Schedules, dependencies (on staff time), and costs
• Scientific Workflows – on in silico data– Data throughput, dependencies (on analysis results)– Input, algorithm, output– Flow of information, scheduling of order, collection of results,
intermediate results and provenance
• High level description of your experimentHigh level description of your experiment• Workflow is the model of experiment
– Methods section in your publication
Taverna and myGrid
• myGrid a suite of components designed to support in• myGrid a suite of components designed to support in silico experiments in biology
• Taverna workbench – main user interface • Semantic service discovery components• myGrid Ontology for bioinformatics services• Provenance components• myGrid provenance ontology
06.10.2007
3
OMII-UK
• University of Manchester (myGrid) joined with the Universities of Edinburgh (OGSA-DAI) and Southampton (OMII phase 1) in March 2006
• OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its international collaboratorsinternational collaborators.
• A guarantee of development and support
Workflow diagram
Available services
Tree view of workflow structure
06.10.2007
4
Who Provides the Services?
• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers
– NCBI, DDBJ, EBI …• Enforce NO common data model.
Quality Web• Quality Web Services considered desirable
What types of service?
• WSDL Web Services• BioMart• BioMart • R-processor• BioMoby• Soaplab• Local Java services• BeanshellBeanshell• Workflows
06.10.2007
5
Who uses Taverna?
~38800 downloadsUsers worldwide • Systems biology• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Genotype/Phenotype studies• Health Informatics
A t• Astronomy• Chemoinformatics• Data integration
• ISMB07 – 6 posters, 2 demos,1 BOF, 1 tutorial
What do Scientists use Taverna for?
• Data gathering and annotating• Data gathering and annotating– Distributed data and knowledge
• Data analysis– Distributed analysis tools and
• Data mining and knowledge management– Hypothesis generation and modelling
06.10.2007
6
Data Gathering
• Collecting evidence from lots of places• Accessing local and remote databases extracting info• Accessing local and remote databases, extracting info
and displaying a unified view to the user
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccatttattttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Case Study – Graves Disease
• Autoimmune disease that causes hyperthyroidism • Antibodies to the thyrotropin receptor result in• Antibodies to the thyrotropin receptor result in
constitutive activation of the receptor and increased levels of thyroid hormone
• Original myGrid Case Study
Ref: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R Pearce S and Wipat A (2004) Association of variations inR, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004
06.10.2007
7
Graves Disease
The experiment:The experiment: • Analysing microarray data to determine genes
differentially-expressed in Graves Disease patients and healthy controls
• Characterising these genes (and any proteins encoded by them) in an annotation pipeline
• From affymetrix probeset identifier, extract information about genes encoded in this region.
• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement
Annotation Pipeline
Evidence includes:Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology termsGene Ontology terms
06.10.2007
8
Output 1
id, strand, position, -, gact-pos, aa-posrs11557924,1,31930994,coding,140,47rs11557924,1,31930994,coding,140,47rs11557921,1,31930996,coding,142,48rs17855850,1,31931855,coding,1001,334rs483638,1,31931855,coding,1001,334rs17854926,1,31931904,coding,1050,350rs35770727,1,31931925,coding,1071,357rs34963839 1 31931940 coding 1086 362rs34963839,1,31931940,coding,1086,362rs33998554,-1,31932054,coding,1200,400
06.10.2007
9
Utopia
Shared Semantics
• Taverna and Utopia share the same semantic model • Based on the myGrid ontology and the Feta semantic• Based on the myGrid ontology and the Feta semantic
discovery components• Services in workflows are described using terms from
the myGrid ontology• Service descriptions are queried using Feta
– Service function, inputs/outputs
06.10.2007
10
An Automatic Annotation Pipeline
• Genome annotation pipelines workflow assembles• Genome annotation pipelines – workflow assembles evidence for predicted genes / potential functions
• Human expert can ‘review’ this evidence before submission to the genome database
Collaboration with the Bergen Center for Computational Science (computational biology unit) – Gene Prediction in Algal Viruses, a case study. Presented at NETAB2005http://www.nettab.org/2005/docs/NETTAB2005_LanzenPoster.pdf
User Interaction During the workflow
06.10.2007
11
Data Analysis
• Access to local and remote analysis tool• You start with your own data / public data of interest• You need to analyse it to extract biological knowledge
Trypanosomiasis in Africa Steve
Andy B
rassK
emp
http://www.genomics.liv.ac.uk/tryps/trypsindex.html
sP
aul Fisher
06.10.2007
12
Trypanosomiasis Study
• A form of Sleeping sickness in cattle Known as n’gana• A form of Sleeping sickness in cattle – Known as n gana• Caused by Trypanosoma brucei
• Some cattle breeds more resistant than others• What are the differences between resistant and
susceptible cattle?• Can we breed cattle resistant to n’gana infection?
Trypanosomiasis Study
Understanding PhenotypeUnderstanding Phenotype• Comparing resistant vs susceptible strains – MicroarraysUnderstanding Genotype• Mapping quantitative traits – Classical genetics QTL
Need to access microarray data,Need to access microarray data, genomic sequence information, pathway databases AND integrate the results
06.10.2007
13
Genotype Phenotype
?
Microarray + QTL
Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping
Genes captured in microarray experiment and present in QTL region
Key:
A – Retrieve genes in QTL region
B – Annotate genes with external database Ids
C – Cross-reference Ids with KEGG gene ids
D – Retrieve microarray data from MaxD database
E – For each KEGG gene get the pathways it’s g p yinvolved in
F – For each pathway get a description of what it does
G – For each KEGG gene get a description of what it does
06.10.2007
14
Results
• Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistanceis believed to play a role in trypanosomiasis resistance.
• Manual analysis on the microarray and QTL data had failed to identify this gene as a candidate.
• Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved incorrelations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33
Why was the Workflow Approach Successful?
• Workflow analysed each piece of data systematically• Workflow analysed each piece of data systematically– Eliminated user bias and premature filtering of datasets and
results leading to single sided, expert-driven hypotheses
• The size of the QTL and amount of the microarray data made a manual approach impractical
• Workflows capture exactly where data came from and how it was analysedhow it was analysed
• Workflow output produced a manageable amount of data for the biologists to interpret and verify– “make sense of this data” -> “does this make sense?”
06.10.2007
15
Trichuris muris(mouse whipworm) infection
parasite model of the human parasite - Trichuris trichuria)
• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
• Manual experimentation: Two year study of candidate genes, processes unidentified
• Workflows: trypanosomiasis cattle experiment was reused without change.
• Analysis of the resulting data by a biologist found the processes in a couple of days.
Joanne Pennock, Richard GrencisUniversity of Manchester
Prime Minister's Office Thailand Center of Excellence for Life Sciences (TCELS)
Pharmacogenomics project
Wasun ChantratitaProject director
2003->2006
06.10.2007
16
Pharmacogenomics
• Heavy use of R-Statistics for clinical data analysis• Association study of Nevirapine induced skin rash in• Association study of Nevirapine-induced skin rash in
Thai Population• A systemic (bodywide) allergic reaction with a
characteristic rash– 100 Cases: rash – 100 Cases: no rash controls– 10,000 SNP significantly associated with rash
Pathway analysis and systems biology– Pathway analysis and systems biology– Prioritising SNPs– Functional studies– Diagnostic tools
Data Mining and Knowledge Integration
06.10.2007
17
My BioAID: Hypothesis Construction from the Literature
Marco Roos and Scott Marshall Adaptive InformationMarco Roos and Scott Marshall, Adaptive Information Disclosure, Faculty of Science, University of Amsterdam
• Combines text mining with ontology modelling• AIDA: text mining and machine learning toolbox
Start with a Proto-Ontology
Small amount of information about a topic of interest
e.g. review article about histones and disease
06.10.2007
18
Mine the literature for more…
The workflow produces OMIM tagged diseases which can be used to enrich the proto-ontology automatically in RDF
TaWeka - Taverna to Weka
• Luna De Ferrari and Igor Goryanin– University of Edinburgh
• Combines Taverna workflows with Weka data mining tools
• Produces classifiers for biological knowledge– Predict circadian expression of
genes in plants– Support curation of metabolic
pathways• Promotes sharing and reuse of
classifiers because both the data and knowledge workflow are tracked
06.10.2007
19
Sharing Experiments
• myGrid supports the in silico experimental process for individual scientistsindividual scientists
• How do you share your results/experiments/experiences with your– Research group– Collaborators– Scientific community
• How do you compare your results with others produced• How do you compare your results with others produced by e.g. Kepler / Triana?
06.10.2007
20
Collaborative, Social Bookmarkingg
Content Sharing
Application Execution
Social Recommendations
Collaborative, Social Tagging
06.10.2007
21
Summary
Taverna• allows interoperation between local and remote resources• allows interoperation between local and remote resources• allow automated access or analysis to sets of data• helps with data integration• Is extensible and open source – for application embedding
MyExperimentMyExperiment• Allows sharing across particular communities• Provides a central location for publishing/finding useful
workflows
myGrid acknowledgements
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop
• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.
• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson
• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis Alvaro Fernandes Justin Ferris Robert Gaizaukaus Kevin Glover ChrisDavis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.
• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.
http://www.mygrid.org.uk
http://www.myexperiment.org
06.10.2007
22
• Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematicKemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of genotype phenotype correlations: identification of candidate genes involved in African trypanosomiasis.Nucleic Acids Res.35(16):5625-33
Links