39
my Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Embed Size (px)

Citation preview

Page 1: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

myGrid and Taverna:Now and in the Future

Dr. K. Wolstencroft

University of Manchester

Page 2: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Background

• myGrid middleware components to support in silico experiments in biology

• Originally designed to support bioinformatics

chemoinformatics

health informatics

medical imaging

integrative biology

Page 3: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

History

EPSRC funded UK eScience Program Pilot Project

Page 4: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

myGrid in OMII-UK

OGSA-DAI

myGrid

OMII Stack

March 2006

10 DevelopersDedicated design, implementation, testing and support team – moving towards production quality software

Page 5: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Lots of Resources

NAR 2006 – over 850 databases

Page 6: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

The User Community

Bioinformatics is an open Community• Open access to data• Open access to resources• Open access to tools• Open access to applications

Global in silico biological research

Page 7: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

The User Community Problems

• Everything is Distributed

– Data, Resources and Scientists

• Heterogeneous data • Very few standards

– I/O formats, data representation, annotation – Everything is a string!

Integration of data and interoperability of resources is difficult

Page 8: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Page 9: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

myGrid Approach - Workflows

General technique for describing and enacting a process

describes what you want to do, not how you want to do it

Simple language specifies how bioinformatics processes fit together – processes are web services

- High level workflow diagram separated from any lower level coding – therefore, you don’t have to be a coder to build workflows

RepeatMasker

Web service

GenScanWeb Service

BlastWeb Service

Sequence Predicted Genes out

Page 10: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Freefluo Workflow enactor

Scufl + Workflow Object Model

Processor Processor

PlainWeb

Service

Soaplab

Processor

LocalApp

Processor

Enactor

TavernaWorkbench

Processor

BioMOBY

Processor

SeqHound

Processor

BioMART

SCUFL

Application data flow layerScufl graph + service introspection

Execution flow layer List management; implicit iteration mechanism; MIME & semantic type decoration; fault management; service alternates

Processor invocation layer

Workflow Execution

Page 11: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Taverna Workflow Components

Scufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications available

Freefluo Workflow engine to run workflows

Freefluo

SOAPLABWeb Service

Any Application

Web Service e.g. DDBJ BLAST

Page 12: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

What Services we Support

Page 13: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

User Interaction Handling

• Interaction Service and corresponding Taverna processor allows a workflow to call out to an expert human user

• Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline

Collaboration with the University of Bergen

Ref: Poster, Nettab 2005

• R for numerical analysis (microarray informatics amongst others)

Page 14: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

What shall I do when a service fails?

• Most services are owned by other people• No control over service failure• Some are research level

Workflows are only as good as the services they connect!

To help - Taverna can:• Notify failures• Instigate retries• Set criticality• Substitute services

Page 15: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

myGrid Users

• ~20000 downloads

• Users in US, Singapore, UK, Europe, Australia

• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis

Page 16: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Trypanosomiasis Study

Resistance to trypanosomiasis in cattle in Kenya

Andy Brass, Paul Fisher – University of Manchester

•Form of Sleeping sickness in cattle –

Known as n’gana

•Caused by Trypanosoma brucei

Page 17: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Study involves

Microarray data

QTL

SNPs

Metabolic pathway analysis

Need to access microarray data, genomic sequence information, pathway databases AND integrate the results

Page 18: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester
Page 19: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Workflow Reuse

Addisons Disease

SNP design

Protein annotation

Microarray analysis

myGrid Workflow Repository

http://workflows.mygrid.org.uk/repository

Page 20: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Scufl Workflows + Taverna Workflow Workbench

OGSA-Distributed Query Processing

Results management

LSID

mIR

e-Science coordination e-Science mediator

e-Science process patterns

e-Science events

Notification service

Components designed to work together

myGrid information model

Metadata & provenance management using semantics

KAVE

Service management

Publication and Discovery using semantics

Feta

Pedro

Ontology

Portal & Application tools

Page 21: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Data Management

• Workflows can generate vast amount of data - how can we manage and track it?

• Data AND metadata AND experiment provenance• LSIDs - to identify objects• Semantic Web technologies (RDF, Ontologies)

– To store knowledge provenance

• Taverna workflow workbench & plugins– Ensure automated recording

Page 22: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

KAVE Data and metadata management

• Life Science Identifiers (LSIDs)• Information Model• File management• Support for custom database

building• Provenance metadata capture

using RDF• SRB integration• OGSA-DAI integration

urn:data:f2

urn:data:f2

urn:data1urn:data1

urn:data2urn:data2

urn:compareinvocation3urn:compareinvocation3

urn:data12

urn:data12

Blast_report

[input]

[output]

[input]

[distantlyDerivedFrom]

SwissProt_seq

[instanceOf]

Sequence_hit

[hasHits]

urn:hit2….

urn:hit2….

urn:hit1…urn:hit1…

urn:hit50…..

urn:hit50…..

[instanceOf]

[similar_sequence_to]

Data generated by services/workflows

Concepts

[ ]

[performsTask]

Find similar sequence[contains]

Services

urn:data:3urn:data:3

urn:hit8….

urn:hit8….

urn:hit5…urn:hit5…

urn:hit10…..

urn:hit10…..

[contains]

[instanceOf]

urn:BlastNInvocation3urn:BlastNInvocation3

urn:invocation5urn:invocation5urn:data:f1

urn:data:f1

[output]

New sequence

Missed sequence

[hasName] [hasName

]

literalsDatumCollection

[type]

LSDatum

[type]Properties

[instanceOf]

[output]

[output]

[directlyDerivedFrom]

Page 23: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Provenance Browsing in Taverna

New in Taverna 1.4

Page 24: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Feta Semantic Discovery

Over 3000 services!

Find services by their function

Questions we can ask:

Find me all the services that perform a multiple sequence alignment And accepts protein sequences in FASTA format as input

Page 25: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Upper level ontology

Task ontology

Informatics ontology

Molecular Biology ontology

Bioinformatics ontology

Web Service ontology

Specialises

Contributes to

sequence

biological_sequence

protein_sequence

nucleotide_sequence

DNA_sequence

protein_structure_feature

BLASTp service

Similarity Search Service

BLAST service

InterProScan service

myGrid Ontology

Page 26: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Feta Architecture

Feta EngineService

Semantic Discovery

Taverna Workbench

Feta G

UI C

lient

DL ReasonerOntology Editor

Ontologist

User Classification

- In RDF(S) - Build myGrid Domain Ontology

Obtain descriptions

Obtain Classification

3

3

4

Feta Descriptions

Feta Descriptions

Feta Descriptions

Page 27: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Annotations

• Feta has been available for ~1 year• Not yet in the release• Need critical mass of services before release

• Annotation experiments with users and domain experts

• Domain expert annotations much better – – hiring a full-time annotation – see the myGrid website for details

Page 28: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Gene annotation pipeline workflow Integration and visualisation of GD annotation workflow results

Provenance Record

Custom Data Model

Input

Result

Results Integration

Smarter workflow design incorporating visualisation VBI collaboration

Page 29: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Utopia

SeqVista

Visualisation

Page 30: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

New Plans for Taverna 2.0

Page 31: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Evolving challenges

• Long running data intensive workflows• Manipulation of confidential or otherwise protected

information• Use with classical grid systems• Interaction with users during workflows

Page 32: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Development

• Development of Taverna 2.0– reworking of the processor model to include duel

execution semantics incorporating data and control flow

– enhanced support for long-running workflows• fully distributed workflow enactment and authoring • User steering

– large scale data transfer

Page 33: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Enhanced Processor Model

• Modular dispatcher mechanism– Dynamic service binding– Recursive invocation– Data filter implementation– Retry, failover, back-off behaviours

• Transparent third party data transfers• High throughput stream handling with implicit iteration

semantics

Page 34: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

3rd Party Data Transfers

• Allows ‘in place’ referencing of data – Large data sets no longer round-trip between workflow engine

and data provider– Allows restricted access to sensitive data

• Automatic de-reference when a reference type is linked to a value type within a workflow.– Connecting a grid service to a web service

Page 35: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Streaming Data

• Allow execution of downstream workflow stages on partially complete results from upstream.

Service 1 Service 2 Service 3

Non streaming (Taverna 1), entire iteration must complete at each stage

Streamed data, Service 2 starts operating on partial results from Service 1

Page 36: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Recursive Invocation

• Dispatcher allowing recursive invocation to be plugged into per operation semantics.

Test Forcompletion

Invokeoperation

ModifyInput Set

GatherResult Set

Return Result

ReceiveInput

Page 37: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Future Direction

• Enhancements to the Workflow Core• Enhancements to user interface and experience• Expanded use of semantic web technologies

• Code remains open source and always will

Page 38: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Latest News

• See plans for Taverna 2.0 on myGrid wiki• Taverna development is user-driven

– Please keep in touch and tell us what you would like to see by the myGrid mailing lists: Taverna Users, Taverna Hackers

• Bioinformatics curator for service annotation

Details on the myGrid website

Page 39: My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester

Acknowledgements

• The myGrid group – Past and Present• OMII-uk

• Carole Goble• Pinar Alper• Tom Oiin• Antoon Goderis• Matthew Gamble• Daniele Turi