20
Workflows on the Cloud: Scaling for National Service Katy Wolstencroft, Robert Haines, Helen Hulme, Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble University of Manchester, UK Madhu Donepudi, Nick James Eagle Genomics Ltd, UK

Wolstencroft K - Workflows on the Cloud: scaling for national service

Embed Size (px)

DESCRIPTION

Presentation at BOSC2012 by Wolstencroft K - Workflows on the Cloud: scaling for national service

Citation preview

Page 1: Wolstencroft K - Workflows on the Cloud: scaling for national service

Workflows on the Cloud:Scaling for National Service

Katy Wolstencroft, Robert Haines, Helen Hulme,Mike Cornell, Shoaib Sufi, Andy Brass, Carole Goble

University of Manchester, UK

Madhu Donepudi, Nick JamesEagle Genomics Ltd, UK

Page 2: Wolstencroft K - Workflows on the Cloud: scaling for national service

Motivation: Workflows for Diagnostics

NHS genetic testing, e.g. colon disease Annotation of SNPs in patient data, ready for interpretation by clinician.Diagnostic Testing TodayPurify DNA. PCRs exons of relevant genes (MLH1, MSH2, MSH6).Sequence, identify variants, classify: (pathogenic, not pathogenic, unknown significance etc.).Writes report to clinicianDiagnostic Testing Tomorrow (or later today) uses whole genome sequencing

Next Gen Seq data

Variation data

ANNOTATE, FILTER, DISPLAY

New problem: How do we classify all the variants that we discover?

Page 3: Wolstencroft K - Workflows on the Cloud: scaling for national service

Taverna Workflows

Sophisticated analysis pipelines A set of services to analyse or

manage data (either local or remote)

Workflows run through the workbench or via a server

Automation of data flow through services

Control of service invocation Iteration over data sets Provenance collection Extensible and open source

Page 4: Wolstencroft K - Workflows on the Cloud: scaling for national service

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W729-32.Taverna: a tool for building and running workflows of services.Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T.

Freely availableopen source

Current Version 2.4

80,000+ downloads across version

Part of the myGrid Toolkit

Tavernahttp://www.taverna.org.uk/

Windows/Mac OS X/Linux/unix

Page 5: Wolstencroft K - Workflows on the Cloud: scaling for national service

SNP annotation

Annotation taskLocation, Gene, TranscriptPresent in public databases, dbSNP etcFrequency in e.g. 1000 genome dataConservation data (cross species)

Workflows are good for collecting and integrating data from a variety of sources, into one place

Page 6: Wolstencroft K - Workflows on the Cloud: scaling for national service

Variant ClassificationSNP

Nonsense: base insertion, causing a frameshift

Synonymous Missense: Non-synonymous

Premature StopNonsense codon

Affects on splicingAffects on function or splicing

Page 7: Wolstencroft K - Workflows on the Cloud: scaling for national service

SNP Filtering / Triage

Which SNPs are the most important?Reduction of 80K data points to those with potential clinical significance.CriteriaReduce to (disease)-specific gene listSense < Missense < Stop codon etcBased on prediction tool scoresFrequency in population (based on 1000 genome data etc) (high frequency implies non deleterious)Conservation across species (implies that change is deleterious)

Page 8: Wolstencroft K - Workflows on the Cloud: scaling for national service

Workflow Provenance

Record inferences in clinical decisions

What were the parameters used to build the dataset

What versions of databases, genome assembly, machine

Where does each piece of evidence for/against pathogenicity originate from?

Page 9: Wolstencroft K - Workflows on the Cloud: scaling for national service

Infrastructure Requirements

Execute analysis workflows Accessible to clinicians and genetic testers Cope with expanding demands on compute Provide a secure environment Collect provenance

Page 10: Wolstencroft K - Workflows on the Cloud: scaling for national service

Architecture overview

Webinterface

InputSNPs

Results

Storage (S3)

Ensembl (mySQL)

Cache(S3)

Taverna Server

Taverna Server

Taverna Server

Workflow engine

orchestrator

e-Hive

other

Taverna

Application specific tools and Web Services

Application specific tools and Web Services

Application specific tools and Web Services

WS WS Tool

ToolWS

All user interaction via web interface

User data stored in the Cloud

Data for all tools and Web Services stored in the Cloud

Unified access to different workflow engines with our common REST API

Tools and Web Services for each workflow are installed together for easy replication

Page 11: Wolstencroft K - Workflows on the Cloud: scaling for national service

Workflow engine orchestration

Orchestrator is workflow executor agnostic

Uses common API to: List workflows Configure runs Start runs Manage current runs

Status Progress

Delete runs

Workflow engineorchestrator

e-Hive Taverna

Taverna Interface

e-Hive Interface

Common REST API

Engine specific APIs

Cache

Page 12: Wolstencroft K - Workflows on the Cloud: scaling for national service

Additional Taverna Functionality

Integration with Cloud infrastructure AWS first

Read/write files securely to S3 Start and stop Cloud instances if required

Tool and Web Service scaling Self-scaling

Released as part of Taverna 3

Page 13: Wolstencroft K - Workflows on the Cloud: scaling for national service

The user’s view Curated set of workflows

Designed, built and tested by domain experts Quality assurance tested (if appropriate)

Workflows are presented as applications The workflows themselves are hidden Configured and run via a web interface

All user data stored securely in the Cloud User separation

Workflows as a Service

Page 14: Wolstencroft K - Workflows on the Cloud: scaling for national service

Web interface: Overview

Upload input data Configure workflow runs with

Input parameters Uploaded data Reused output data

Start workflow runs Monitor workflow runs View results preview Download complete results

Page 15: Wolstencroft K - Workflows on the Cloud: scaling for national service

Web interface: Getting started

Page 16: Wolstencroft K - Workflows on the Cloud: scaling for national service

Web interface: Creating a Run

Page 17: Wolstencroft K - Workflows on the Cloud: scaling for national service

Web interface: Checking run progress

Page 18: Wolstencroft K - Workflows on the Cloud: scaling for national service

A Typical Workflow Parse files from SNP calling

machines Annotate SNPs Predict effects (BioMart, VEP,

polyphen)

Page 19: Wolstencroft K - Workflows on the Cloud: scaling for national service

Workflow as a Service

The workflow IS the serviceRun restricted sets of Taverna workflows in the cloudConnects to other cloud based resources – storage, tools

etcUsers can tweak parameters, but not design their ownWeb portal access for scientistsData passed by reference instead of filePay as you go – cheap at the point of useElastic and available now

Page 20: Wolstencroft K - Workflows on the Cloud: scaling for national service

Acknowledgements/Partners University of

Manchester Eagle Genomics Technology Strategy

Board 100932 - Cloud Analytics

for Life Sciences National Health

Service Amazon Web Services