Service Discovery in myGrid and the Biocatalogue, a Life Science Service
Registry
Katy Wolstencroft
myGrid
University of Manchester
Lots of Resources
NAR 2008 – over 1000 databases
Taverna Workflow Workbench
• Design and execution of workflows
• Access to local and remote resources and analysis tools
• Automation of data flow• Iteration over large data
sets• Part of the myGrid project
Who Uses Taverna?
Access to 3500+ public service operations
55,000+ sourceforge downloads10,000+ downloads of v1.740+ downloads per dayRanked 148 sourceforge activity
(11 Nov 2008)350+ known organisations17 known commercial1000+ active users at any one timeUsers throughout UK, USA,
Europe, SE Asia and South America
Netherlands Bioinformatics CentreGenome Canada Bioinformatics PlatformBioMOBYUS iPlant ConsortiumUS FLOSS social science programRENCIFrench SIGENAE farm animals projectThaiGridCARMEN Neuroscience projectSPINE consortiumEU Enfin, EMBRACE, BioSapian, CasimirEU SysMO ConsortiumNEBC The NERC Environmental Bioinformatics
CentreBergen Centre for Computational BiologyMax-Planck institute for Plant Breeding
ResearchGenoa Cancer Research CentreAstroGridcaBIG/caGRID
What do Scientists use Taverna for?
• Data gathering, annotation and model building
• Data analysis from distributed tools
• Data mining and knowledge management– Hypothesis generation and
modelling and Text mining
• Data curation and warehouse population
• Parameter sweeps and simulation
Systems biology model buildingProteomicsSequence analysisProtein structure predictionGene/protein annotation ProteomicsMicroarray data analysisQTL studiesQSAR studiesChemoinformaticsMedical image analysisPublic Health care epidemiologyHeart model simulationsHigh throughput screeningPhenotype studiesPhylogenyStatistical analysisText miningAstronomy, Music, Meteorology
Create and run workflows
Create and manage services as components
API Consumer
Share, discover and reuse workflows
Manage the metadata needed and generated
RDF, OWL
Discover and reuse services
Feta
Open Source Workflow Environment for Scientists
Workflow Reuse
• Workflows allow high throughput experiments and automation
• Workflows are encapsulations of experiments• Workflows developed for one experiment can be reused
for others
• Easier to share, reuse and repurpose
The METHODS section of a scientific publication
Recycling, Reuse, Repurposing
• Paul writes workflows for identifying biological pathways implicated in resistance to Trypanosomiasis in cattle
• Paul meets Jo. Jo is investigating mouse Whipworm infection.
• Jo reuses one of Paul’s workflow without change.
• Jo identifies the biological pathways involved in sex dependence in the mouse model, believed to be involved in the ability of mice to expel the parasite.
• Previously a manual two year study by Jo had failed to do this.
Where are the Services From?
• Over 3500 services available
• Major Service Providers– European Bioinformatics Institute– DNA DataBank of Japan– NCBI – USA
• ‘Boutique’ Services– Individual research labs producing public data sets– Specialist tools for niche experiments
• We are not service providers
What types of services?
• HTML• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Local Java services• Beanshell• Workflows• ….coming soon – REST, Matlab
Variable or non-existent documentation or help
Taverna in a ‘open’ world
Advantages• Connection to lots of resources• Flexible system• Can adapt to new technologies
Disadvantages• Services are developed for other purposes• We can’t control how they work• We have to deal with the heterogeneity
Finding Services
When using services, scientists need to:• Find them – in distributed locations, produced by
different host institutions• Interpret them – what do the services do - what
experiments can they perform using them?• Know how to invoke them – what data and initial
parameters do they need to supply?
Metadata from a WSDL
<wsdl:message name="getGlimmersResponse"> <wsdl:part name="getGlimmersReturn" type="xsd:string"/> </wsdl:message> <wsdl:message name="aboutServiceRequest"/> <wsdl:message name="getGlimmersRequest"> <wsdl:part name="in0" type="xsd:string"/> <wsdl:part name="in1" type="xsd:string"/> <wsdl:part name="in2" type="xsd:string"/> <wsdl:part name="in3" type="xsd:string"/> <wsdl:part name="in4" type="xsd:string"/> <wsdl:part name="in5" type="xsd:string"/> <wsdl:part name="in6" type="xsd:string"/> <wsdl:part name="in7" type="xsd:int"/> <wsdl:part name="in8" type="xsd:string"/>
Pathport Web service from the Virginia Bioinformatics Institute
http://pathport.vbi.vt.edu/services/wsdls/beta/glimmer.wsd
Name of the service
Uninformative names for parameters
What kind of string?
Semantics and Web Services
• SAWSDL – Semantic Annotations for WSDL working group
• Virtually no uptake by bioinformatics service providers
• Doesn’t address non-WSDL services
Adding Semantics – Annotating Services
Find services by their function instead of their name
• The services might be distributed, but a registry of service descriptions can be central and queried
• We need to annotate services with semantics
In myGrid, we use the Feta Semantic Discovery tool
and a semantic annotation tool – and expert curation
myGrid Ontology
Logically separated into two parts:
• Service ontologyPhysical and operational features of (web) services
• Domain ontology (Semantic Content Model)Annotation vocabulary for core bioinformatics data, data types and their relationships
Service Ontology
• Models services from the point of view of the scientist– Where is it? – How many inputs/outputs?– Who hosts it?
• Invocation details are hidden by the Taverna workbench
• Differs from related initiatives in this respect
Domain Ontology
• Informatics: captures the key concepts of data, data structures, databases and metadata.
• Bioinformatics: The domain-specific data sources (e.g. the model organism sequencing databases), and domain-specific algorithms for searching and analyzing data (e.g. the sequence alignment algorithm, clustalw).
• Molecular biology: Concepts include examples such as, protein sequence, and nucleic acid sequence.
• Formats: A hierarchy describing bioinformatics file formats. For example, fasta format for sequence data, or phylip format for phylogenetic data
• Tasks: A hierarchy describing the generic tasks a service operation can perform. Examples include retrieving, displaying, and aligning.
Example Service Annotation
• Example : BLAST from the DDBJ– Performs task: Alignment– Uses Method: Similarity Search Algorithm– Uses Resources: DNA/Protein sequence databases– Inputs:
• biological sequence (and format)
• database name (and format)
• blast program (and format)
– Outputs: Blast Report
myGrid Ontology
First version of the ontology ~ 2002
Originally developed in DAML+OIL
Now developed in OWL and a version exported to RDFS
Number of classes in the ontology ~750
Domain and service ontology used by myGrid users and developers of myGrid related plugins
Service ontology also used by BioMoby
W3C compliant WRT ontology modelling
How do we use the ontology?
Two methods of service description
1. Decision Support - queryingComposite matches to ontology terms
Multiple terms are used to query the annotations
2. Decision Making - reasoningSingle description – whole service model
Enables automated detection of service mismatchesEnables possibility of automated addition of services
Curation Sweatshop
Steady increase in numbers of services and workflows
Users able to find annotated services
BUT
Time-consuming and expensive
More and more services built daily
SO
Should we encourage service providers to add value?
Should we get users involved?
Collaboration between University of Manchester and EBI
Drawing on 6 years experience in Taverna of semantic annotation of services using RDF and OWL ontologies
Drawing on experience at EBI in service provision
Drawing on experience of social curation and networking from myExperiment
First pilot December 2008
Getting the Minimum
Community annotation • Must be easy and quick• Must allow partial descriptions • Multiple annotations of the same service
• What is the minimum information to enable – service discovery– service invocation
Grading Services
• Bronze – enough to locate the service. Example of service invocation
• Silver
• Gold
• Platinum – full description. All properties annotated – including dependencies between them – reliability metrics – AND CHECKED AND VERIFIED BY A CURATOR
Automatic Annotation
• Inferring service descriptions from workflows• Gathering usage data
– How many workflows use this service
• Gathering reliability data - monitoring– When is this service available– How many times does it fail
• Helps with “shopping” for services– People who used this service also used this service– Top 10 services– Services that do the same things
Annotation Provenance
• Who said what about what?• Harvesting community annotation• Verifying and augmenting by a curator• ‘Trust’ Models
• Annotation versions– In a workflow context– As stand alone services
Feta Model
Semantic Content Model
Service Model
CurationModel
Quantitative Content
Tags
Service Model
Semantic Content Model
Ontologies
FunctionalProvenance
OperationalOperationalMetrics
Conditions of Use
Social Standing
Biocatalogue Service Profile
A.N. Other
Curation
Quant’ve
Service Model
Semantic Content Model
ExecutionHost
Service ProfileFinding
WSDL
WADL
S-A.N. Other
SAWSDL
SA-REST
Analytics
Ranking
Browse/Shop
Search
Customised
Service
Workflow
Annotation Process
BioCatalogue: The pilot
Features: User Registration
Service Registration
Search
Annotation
Notification
Integration with myExperiment
For More Information
• BioCatalogue website http://www.biocatalogue.org/
• BioCatalogue wiki http://www.biocatalogue.org/wiki
• myGrid website http://www.mygrid.org.uk/
myGrid Team
Services
Interface
Neutral
Func
tiona
l
Conditions of Use
Operational
Social Standing Oper
ation
al M
etric
sProvenance
Multiply described Third Party
Aggregated FeedsMonitoring
Multiple Sources
Multiple Versions
Dynamic
Multiple Instances
Discovery
Interoperability
Composition
Reuse
TrustedAuthorities
Policies
Ranking