Upload
pammy98
View
270
Download
2
Tags:
Embed Size (px)
Citation preview
Web Services, Workflows & Taverna
Superglue for the Semantic Web
Tom Oinn – EMBL-EBI,
http://mygrid.org.uk http://taverna.sf.net
Who are we? myGrid
An EPSRC funded ‘eScience Pilot Project’
Based across multiple sites in the UK
Taverna A tethered spin-off of the
myGrid project Aimed at producing
powerful tools to complement the basic research work
EBI Hinxton Campus
What is Taverna? Allows scientists to graphically construct
complex processes in the form of workflows What is a workflow?
Set of activities that make up a process Definitions about how data moves between these
activities The user specifies what to do but not how to do it Insulates users from the complexity of
distributed computing
Looks a bit like this…
myGrid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on
Williams-Beuren Syndrome in the medical genetics department
Workflows written by life scientists not computer scientists
Following slides stolen at the last minute from Hannah Tipney at Manchester!
Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous
recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’
Chr 7 ~155 Mb
~1.5 Mb7q11.23
C-cen
C-mid
A-cen
B-mid
B-cen
A-mid
GTF2I
RFC2
CYLN2
GTF2IRD1
NCF1
WBSCR1/E1f4H
LIMK1
ELNCLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
TBL2BCL7B
BAZ1B
FZD9
WBSCR5/LAB
WBSCR22
FKBP6
POM121
NOLR1
GTF2IRD2
B-telA-tel
C-tel
WBSCR14
STAG3PMS2L
Blo
ck A
FKBP6T
POM121NOLR1
Blo
ck C
GTF2IPNCF1PGTF2IRD2P
Blo
ck B
CTA-315H11
CTB-51J22
Gap
Physical Map
Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164
Williams-Beuren Syndrome Microdeletion
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetitive elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterPro
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
BlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence
BLASTwrapper
Sort for appropriate Sequences only
RepeatMasker
TF binding Prediction
Promotor Prediction
Regulation Element Prediction
Identify regulatory elements in genomic sequence
Experiment
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Analysis via ‘Cut and Paste’
A B C
A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence
Workflows
The Biological Results
CTA-315H11 CTB-51J22
ELN
WBSCR14
RP11-622P13 RP11-148M21 RP11-731K22
314,004bp extension
All nine known genes identified(40/45 exons identified)
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
WBSCR22
WBSCR24
WBSCR27
WBSCR28
Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified
Different Kinds of Services Pure web services are not always the
solution Abstraction Level? Typing? Description? Data Volumes?
Taverna employs a hybrid architecture which includes web services amongst other components
Complex Invocation Patterns E.g. Soaplab – has a typical factory pattern
‘create job’, ‘set parameter’, ‘run task’, ‘wait’, ‘get results’, ‘destroy task’.
Multiple web service calls per conceptual operation
Handled in Taverna by embedding this invocation pattern within a Soaplab processor.
Large Data Sets No explicit limit to message size in WS specs
but… Most common toolkits equally terrible at handling
large data. WS Standards for bulk data transfer insufficiently
mature or lacking interoperability. Transfer references across WS calls, transfer
actual data ‘out of band’ More info from Jon later, handled in Taverna
via a Styx Grid Service plugin.
Service Description WS standards fail to address the description
of a service. Registries – UDDI is an old standard and
predates work on semantic description BioMoby and myGrid include Semantic
Description and Discovery components. Search for services by task, by input or by past
involvement in another workflow Essential for AI assisted workflow construction
BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)
Multiple Service Types
Taverna Demo There should be a live demo of the Workflow
Workbench here…
Obtaining Taverna Taverna is available under the LGPL from our
project site on Sourceforge.net http://taverna.sourceforge.net
Release 1.0 as of the 20th Jan 2005 (after twelve beta releases)
Includes online and downloadable user manual, examples etc.
Support via project mailing lists
myGrid and WBS People!CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes,
Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical
Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan,
Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker
AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net