22
Genome Informatics 2013 – P.Missier Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central Paolo Missier School of Computing Newcastle University, UK Genome Informatics Lisbon, Dec. 5, 2013 With thanks to Simon Woodman, Jacek Cała

Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Embed Size (px)

DESCRIPTION

Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013

Citation preview

Page 1: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Cloud e-Genome:NGS Workflows on the Cloud

Using e-Science Central

Paolo Missier

School of Computing

Newcastle University, UK

Genome Informatics

Lisbon, Dec. 5, 2013

With thanks to Simon Woodman, Jacek Cała

Page 2: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Cloud e-genome - Motivation

• 2 year pilot project

• Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)

• Nov. 2013: Cloud resources from Azure for Research Award• 1 year’s worth of data/network/computing resources

Aim:

• To translate genetic testing by whole-exome sequencing into clinical practice

Objectives:

• Cost, Scalability: Demonstrate the cost-effectiveness of whole-exome data processing pipelines at population scale

• Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians

Page 3: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Approach and testbed

• Technical Approach:• Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure

(IaaS)

• Port current NGS pipelines (scripts) to cloud-based workflow technology

• Implement user tools for clinical diagnosis as cloud apps (SaaS)

• Testbed and scale:• Neurological patients from the North-East of England, focus on rare diseases

• Initial testing on about 300 sequences

• 2500-3000 sequences expected within 12 months

Page 4: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Key technical requirements

• Scalability• In the rate and number of patient sequence submissions

• In the density of sequence data (from whole exome to whole genome)

• Flexibility, Traceability, Comparability across versions• Simplify experimenting with alternative pipelines (choice of tools, configuration

parameters)

• Trace each version and its executions

• Ability to compare results obtained using different pipelines and reason about the differences

• Openness• Simplify the process of adding:

• New variant analysis tools

• New statistical methods for variant filtering, selection, and ranking

• Integration with third party databases

Prove

nanc

e

Page 5: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Outline

• Current pipelines

• The e-Science Central workflow management system• Home-grown system, cloud-based

• SaaS: “Science as a Service”

• Provenance-aware

• Porting the pipelines to e-Science Central• Expected benefits

• Strategy and issues

• Role of Provenance• “Where do these variants come from?”

• “Why do these results differ?”

Page 6: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Current pipelines – top level view

• Shell scripts control a suite of tools• Deployed on a local HPC cluster

• Loadable modules and overall system maintained by dedicated staff

20 compute nodes

48/96GB RAM / 250GB disk

19TB usable storage space

Gigabit Ethernet

Page 7: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Pipeline – breakdown view

1. Alignment - BWA

2. Cleaning - Picard

3. Sequence recalibration - GATK

5. Variant calling - GATK 6. Variant recalibration -

GATK

4. Coverage - bedTools

Filtering - GATK

7. Annotation- In house- WAnnovar

Page 8: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Why move to the cloud?

Standard benefits of virtualization, including:

• Economy: No capital expenses. Affordable operations expenses

• Elasticity: respond to bursts in submissions without upfront costs

• Scale out vs scale up• New workflows can be deployed on new nodes on demand using VM

• Workflow engine may exploit parallelism within the pipelines

• Virtual storage with no real size limitations

• Possible issues• Security, privacy

• Addressed at policy level

Page 9: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Why port to workflow?

• Programming:• Workflows provide better abstraction in the specification of pipelines

• Workflows directly executable by enactment engine

• Easier to understand, share, and maintain over time

• Flexible – relatively easy to introduce variations

• System: minimal installation/deployment requirements• Fewer dedicated technical staff hours required

• Automated dependency management, packaging, deployment

• Extensible by wrapping new tools

• Exploits data parallelism when possible

• Execution monitoring, provenance collection• Persistence trace serves as evidence for data

• Amenable to automated analysis

Page 10: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting – in progress

Page 11: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting - alignment

Page 12: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting - alignment

Paolo Missier
redo mapping
Page 13: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting – Sequence recalibration

Page 14: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

e-Science Central middleware

• Portability across platforms and configurations

• Single computer

• Cluster of local servers

• Public cloud providers (Microsoft Azure, but also AWS)

• A platform for our academic research

• Scalability, data management, cloud computing, medical data management

Store, manage and process data

• Middleware on top of commodity cloud

storage and compute resources

Page 15: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

e-Science Central: scalable, extensible

• Client APIs / integrating other components

• New workflow blocks easily programmed (in Java)

• Integrate multiple runtime environments

• - R, Octave, Java, Javascript, (Perl)

• Workflow tasks mapped to multiple workers

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data

<<worker role>>Workflow

engine

<<worker role>>Workflow

engine

e-SC blob store

<<worker role>>Workflow

engine

Page 16: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

E-Science Central is Provenance-aware

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis - debugging, improvement, evolution

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

Why does provenance matter?

Page 17: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Role of provenance in Cloud e-genome

• Ultimately, provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?

2. Why have you concluded they are disease-causing?

• Requires ability to trace variants through workflow execution• Simple scripting lacks this functionality

• Includes tracing of user decision processes(Still experimental)

Page 18: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Comparing results across pipeline configurations

Run pipeline version V1

V1 V2:Replace BWA versionModify Annovar configuration parameters

Variant list VL1

Variant list VL2Run pipeline version V2

??

Variant list VL1

Variant list VL2

DDIFF(data differencing)

PDIFF(provenance differencing)

Page 19: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

PDIFF - overview

WA

WB

Page 20: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

The corresponding provenance traces

Page 21: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Delta graph computed by PDIFF

PDIFF helps determine the impact of variations in the pipeline

Page 22: Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Summary

• Cloud e-genome has just begun (Sept. 2013)

• Variant calling and interpretation for clinical use

• Whole-exome sequence processing on a cloud infrastructure• Windows Azure – project sponsor

• Currently porting existing pipelines to e-Science Central workflow

• Provenance as a key element of supporting evidence

• Scalability testing to start in Sept 2014