Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Cloud e-Genome:NGS Workflows on the Cloud

Using e-Science Central

Paolo Missier

School of Computing

Newcastle University, UK

Genome Informatics

Lisbon, Dec. 5, 2013

With thanks to Simon Woodman, Jacek Cała

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Cloud e-genome - Motivation

• 2 year pilot project

• Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)

• Nov. 2013: Cloud resources from Azure for Research Award• 1 year’s worth of data/network/computing resources

Aim:

• To translate genetic testing by whole-exome sequencing into clinical practice

Objectives:

• Cost, Scalability: Demonstrate the cost-effectiveness of whole-exome data processing pipelines at population scale

• Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Approach and testbed

• Technical Approach:• Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure

(IaaS)

• Port current NGS pipelines (scripts) to cloud-based workflow technology

• Implement user tools for clinical diagnosis as cloud apps (SaaS)

• Testbed and scale:• Neurological patients from the North-East of England, focus on rare diseases

• Initial testing on about 300 sequences

• 2500-3000 sequences expected within 12 months

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Key technical requirements

• Scalability• In the rate and number of patient sequence submissions

• In the density of sequence data (from whole exome to whole genome)

• Flexibility, Traceability, Comparability across versions• Simplify experimenting with alternative pipelines (choice of tools, configuration

parameters)

• Trace each version and its executions

• Ability to compare results obtained using different pipelines and reason about the differences

• Openness• Simplify the process of adding:

• New variant analysis tools

• New statistical methods for variant filtering, selection, and ranking

• Integration with third party databases

Prove

nanc

e

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Outline

• Current pipelines

• The e-Science Central workflow management system• Home-grown system, cloud-based

• SaaS: “Science as a Service”

• Provenance-aware

• Porting the pipelines to e-Science Central• Expected benefits

• Strategy and issues

• Role of Provenance• “Where do these variants come from?”

• “Why do these results differ?”

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Current pipelines – top level view

• Shell scripts control a suite of tools• Deployed on a local HPC cluster

• Loadable modules and overall system maintained by dedicated staff

20 compute nodes

48/96GB RAM / 250GB disk

19TB usable storage space

Gigabit Ethernet

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Pipeline – breakdown view

1. Alignment - BWA

2. Cleaning - Picard

3. Sequence recalibration - GATK

5. Variant calling - GATK 6. Variant recalibration -

GATK

4. Coverage - bedTools

Filtering - GATK

7. Annotation- In house- WAnnovar

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Why move to the cloud?

Standard benefits of virtualization, including:

• Economy: No capital expenses. Affordable operations expenses

• Elasticity: respond to bursts in submissions without upfront costs

• Scale out vs scale up• New workflows can be deployed on new nodes on demand using VM

• Workflow engine may exploit parallelism within the pipelines

• Virtual storage with no real size limitations

• Possible issues• Security, privacy

• Addressed at policy level

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Why port to workflow?

• Programming:• Workflows provide better abstraction in the specification of pipelines

• Workflows directly executable by enactment engine

• Easier to understand, share, and maintain over time

• Flexible – relatively easy to introduce variations

• System: minimal installation/deployment requirements• Fewer dedicated technical staff hours required

• Automated dependency management, packaging, deployment

• Extensible by wrapping new tools

• Exploits data parallelism when possible

• Execution monitoring, provenance collection• Persistence trace serves as evidence for data

• Amenable to automated analysis

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting – in progress

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting - alignment

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting - alignment

Paolo Missier

redo mapping

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Porting – Sequence recalibration

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

e-Science Central middleware

• Portability across platforms and configurations

• Single computer

• Cluster of local servers

• Public cloud providers (Microsoft Azure, but also AWS)

• A platform for our academic research

• Scalability, data management, cloud computing, medical data management

Store, manage and process data

• Middleware on top of commodity cloud

storage and compute resources

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

e-Science Central: scalable, extensible

• Client APIs / integrating other components

• New workflow blocks easily programmed (in Java)

• Integrate multiple runtime environments

• - R, Octave, Java, Javascript, (Perl)

• Workflow tasks mapped to multiple workers

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data

<<worker role>>Workflow

engine


engine

e-SC blob store


engine

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

E-Science Central is Provenance-aware

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis - debugging, improvement, evolution

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

Why does provenance matter?

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Role of provenance in Cloud e-genome

• Ultimately, provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?

2. Why have you concluded they are disease-causing?

• Requires ability to trace variants through workflow execution• Simple scripting lacks this functionality

• Includes tracing of user decision processes(Still experimental)

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Comparing results across pipeline configurations

Run pipeline version V1

V1 V2:Replace BWA versionModify Annovar configuration parameters

Variant list VL1

Variant list VL2Run pipeline version V2

??

Variant list VL1

Variant list VL2

DDIFF(data differencing)

PDIFF(provenance differencing)

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

PDIFF - overview

WA

WB

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

The corresponding provenance traces

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Delta graph computed by PDIFF

PDIFF helps determine the impact of variations in the pipeline

Gen

ome

Info

rmat

ics

2013

– P

.Mis

sier

Summary

• Cloud e-genome has just begun (Sept. 2013)

• Variant calling and interpretation for clinical use

• Whole-exome sequence processing on a cloud infrastructure• Windows Azure – project sponsor

• Currently porting existing pipelines to e-Science Central workflow

• Provenance as a key element of supporting evidence

• Scalability testing to start in Sept 2014

Technology

Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central