Invited cloud-e-Genome project talk at 2015 NGS Data Congress

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Scalable WES Processing And Variant InterpretationWith Provenance Recording

Using Workflow On The Cloud

Paolo Missier, Jacek Cała, Yaobo Xu,

Eldarina Wijaya, Ryan Kirby

School of Computing Science and Institute of Genetic MedicineNewcastle University, Newcastle upon Tyne, UK

NGS Data Congress

London, June 15th, 2015

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

The Cloud-e-Genome project at Newcastle

1. NGS data processing:

• Implement a flexible WES/WGS pipeline

• Scalable deployment over a public cloud

• Cost control• Scalability• Flexibility

• Of design• Of maintenance

• Ensure accountability through traceability

• Enable analytics over past patient cases

2. Traceable variant interpretation:

• Design a simple-to-use tool to facilitate clinical diagnosis by clinicians

• Maintain history of past investigations for analytical purposes

Objectives: With an aim to:

• 2 year pilot project: 2013-2015• Funded by UK’s National Institute for Health Research (NIHR)• Cloud resources from Azure for Research Award

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Part I: data processing

Objectives:• Design and Implement a flexible WES/WGS pipeline

• Using workflow technology high level programming

• Providing scalable deployment over a public cloud

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Scripted NGS data processing pipeline

RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK

Computes coverage of each read.

VCF Subsetting by filtering, eg non-exomic variants

Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations

Aligns sample sequence to HG19 reference genomeusing BWA aligner

Cleaning, duplicate elimination

Picard tools

Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels

Variant recalibration attempts to reduce false positive rate from caller

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Scripts to workflow - Design

Design Cloud Deployment Execution Analysis

• Better abstraction

• Easier to understand, share, maintain

• Better exploit data parallelism

• Extensible by wrapping new tools

Theoretical advantages of using a workflow programming model

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Workflow Design

echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP

echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED

echo Starting PICARD to remove duplicates...$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = \$SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true

echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”

echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS

“Wrapper”blocksUtility

blocks

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Workflow design

Conceptual:

Actual:

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Anatomy of a complex parallel dataflow

eScience Central: simple dataflow model…

Sample-split:Parallel processing of samples in a batch

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Anatomy of a complex parallel dataflow

… with hierarchical structure

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Phase II, top level

Chromosome-split:Parallel processing of each chromosome across all samples

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Phase III

Sample-split:Parallel processing of samples

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Implicit parallelism in the pipeline

align-clean-recalibrate-coverage

…

align-clean-recalibrate-coverage

Sample1

Samplen

Variant callingrecalibration

Variant callingrecalibration

Variant filtering annotation

Variant filtering annotation

……

Chromosomesplit

Per-sample Parallelprocessing

Per-chromosomeParallelprocessing

Stage I Stage II Stage III

How does the workflow design exploit this parallelism?

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Parallel processing over a batch of exomes

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Cloud Deployment


• Scalability• Fewer installation/deployment requirements, staff hours required

• Automated dependency management, packaging

• Configurable to make most efficient use of a cluster

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Workflow on Azure Cloud – modular configuration

<<Azure VM>>Azure Blob

store

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

rich client app

workflow invocations

e-SC control data

workflow data

<<worker role>>Workflow

engine


engine

e-SC blob store


engine

Workflow engines Module configuration:3 nodes, 24 cores

Modular architecture indefinitely scalable!

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Scripts to workflow


3. Execution

• Runtime monitoring

• provenance collection

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Performance

3 workflow engines perform better than our HPC benchmark on larger sample sizes

Technical configurations for 3VMs experiments:

HPC cluster (dedicated nodes): used 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space

Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Scalability

There is little incentive to grow the VM pool beyond 6 engines

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Cost

Again, a 6 engine configuration achieves near-optimal cost/sample

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Lessons learnt


Better abstraction

• Easier to understand, share, maintain

Better exploit data parallelismExtensible by wrapping new tools

• Scalability Fewer installation/deployment

requirements, staff hours required Automated dependency management,

packaging Configurable to make most efficient

use of a cluster

Runtime monitoring Provenance collection

Reproducibility Accountability

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Part II: SVI- Simple, traceable variant interpretation

Objectives:

• Design a simple-to-use tool to facilitate clinical diagnosis by clinicians

• Maintain history of past investigations for analytical purposes

• Ensure accountability through traceability

• Enable analytics over past patient cases

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

A database of patient cases and investigations

Cases:

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Investigations

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Provenance of variant identification

• A provenance graph is generated for each investigation

It accounts for the filtering process for each variant listed in the result

Enables analytics over provenance graphs across many investigations

- “which variants where identified independently on different cases, and how do they correlate with phenotypes?”

Work in progress!

NG

S D

ata

Con

gres

sLo

ndon

, Jun

e 20

15P.

Mis

iser

Summary

1. WES/WGS data processing to annotated variants

• Scalable, Cloud-based

• High level

• Low cost / sample

2.Variant interpretation:• Simple• Targeted at clinicians• Built-in accountability of genetic diagnosis• Analytics over a database of past

investigations

What we are delivering to NIHR:

Technology

Invited cloud-e-Genome project talk at 2015 NGS Data Congress