Upload
paolo-missier
View
711
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Clinical genomics informatics talk -- Lisbon, Dec. 5, 2013
Citation preview
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Cloud e-Genome:NGS Workflows on the Cloud
Using e-Science Central
Paolo Missier
School of Computing
Newcastle University, UK
Genome Informatics
Lisbon, Dec. 5, 2013
With thanks to Simon Woodman, Jacek Cała
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Cloud e-genome - Motivation
• 2 year pilot project
• Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)
• Nov. 2013: Cloud resources from Azure for Research Award• 1 year’s worth of data/network/computing resources
Aim:
• To translate genetic testing by whole-exome sequencing into clinical practice
Objectives:
• Cost, Scalability: Demonstrate the cost-effectiveness of whole-exome data processing pipelines at population scale
• Usability: Demonstrate a user-facing tool for variant interpretation and genetic diagnosis by clinicians
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Approach and testbed
• Technical Approach:• Move bulk processing, from a dedicated HPC cluster to a cloud infrastructure
(IaaS)
• Port current NGS pipelines (scripts) to cloud-based workflow technology
• Implement user tools for clinical diagnosis as cloud apps (SaaS)
• Testbed and scale:• Neurological patients from the North-East of England, focus on rare diseases
• Initial testing on about 300 sequences
• 2500-3000 sequences expected within 12 months
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Key technical requirements
• Scalability• In the rate and number of patient sequence submissions
• In the density of sequence data (from whole exome to whole genome)
• Flexibility, Traceability, Comparability across versions• Simplify experimenting with alternative pipelines (choice of tools, configuration
parameters)
• Trace each version and its executions
• Ability to compare results obtained using different pipelines and reason about the differences
• Openness• Simplify the process of adding:
• New variant analysis tools
• New statistical methods for variant filtering, selection, and ranking
• Integration with third party databases
Prove
nanc
e
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Outline
• Current pipelines
• The e-Science Central workflow management system• Home-grown system, cloud-based
• SaaS: “Science as a Service”
• Provenance-aware
• Porting the pipelines to e-Science Central• Expected benefits
• Strategy and issues
• Role of Provenance• “Where do these variants come from?”
• “Why do these results differ?”
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Current pipelines – top level view
• Shell scripts control a suite of tools• Deployed on a local HPC cluster
• Loadable modules and overall system maintained by dedicated staff
20 compute nodes
48/96GB RAM / 250GB disk
19TB usable storage space
Gigabit Ethernet
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Pipeline – breakdown view
1. Alignment - BWA
2. Cleaning - Picard
3. Sequence recalibration - GATK
5. Variant calling - GATK 6. Variant recalibration -
GATK
4. Coverage - bedTools
Filtering - GATK
7. Annotation- In house- WAnnovar
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Why move to the cloud?
Standard benefits of virtualization, including:
• Economy: No capital expenses. Affordable operations expenses
• Elasticity: respond to bursts in submissions without upfront costs
• Scale out vs scale up• New workflows can be deployed on new nodes on demand using VM
• Workflow engine may exploit parallelism within the pipelines
• Virtual storage with no real size limitations
• Possible issues• Security, privacy
• Addressed at policy level
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Why port to workflow?
• Programming:• Workflows provide better abstraction in the specification of pipelines
• Workflows directly executable by enactment engine
• Easier to understand, share, and maintain over time
• Flexible – relatively easy to introduce variations
• System: minimal installation/deployment requirements• Fewer dedicated technical staff hours required
• Automated dependency management, packaging, deployment
• Extensible by wrapping new tools
• Exploits data parallelism when possible
• Execution monitoring, provenance collection• Persistence trace serves as evidence for data
• Amenable to automated analysis
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Porting – in progress
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Porting - alignment
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Porting - alignment
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Porting – Sequence recalibration
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
e-Science Central middleware
• Portability across platforms and configurations
• Single computer
• Cluster of local servers
• Public cloud providers (Microsoft Azure, but also AWS)
• A platform for our academic research
• Scalability, data management, cloud computing, medical data management
Store, manage and process data
• Middleware on top of commodity cloud
storage and compute resources
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
e-Science Central: scalable, extensible
• Client APIs / integrating other components
• New workflow blocks easily programmed (in Java)
• Integrate multiple runtime environments
• - R, Octave, Java, Javascript, (Perl)
• Workflow tasks mapped to multiple workers
<<Azure VM>>Azure Blob
store
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
rich client app
workflow invocations
e-SC control data
workflow data
<<worker role>>Workflow
engine
<<worker role>>Workflow
engine
e-SC blob store
<<worker role>>Workflow
engine
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
E-Science Central is Provenance-aware
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis - debugging, improvement, evolution
Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)
Why does provenance matter?
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Role of provenance in Cloud e-genome
• Ultimately, provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution• Simple scripting lacks this functionality
• Includes tracing of user decision processes(Still experimental)
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Comparing results across pipeline configurations
Run pipeline version V1
V1 V2:Replace BWA versionModify Annovar configuration parameters
Variant list VL1
Variant list VL2Run pipeline version V2
??
Variant list VL1
Variant list VL2
DDIFF(data differencing)
PDIFF(provenance differencing)
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
PDIFF - overview
WA
WB
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
The corresponding provenance traces
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Delta graph computed by PDIFF
PDIFF helps determine the impact of variations in the pipeline
Gen
ome
Info
rmat
ics
2013
– P
.Mis
sier
Summary
• Cloud e-genome has just begun (Sept. 2013)
• Variant calling and interpretation for clinical use
• Whole-exome sequence processing on a cloud infrastructure• Windows Azure – project sponsor
• Currently porting existing pipelines to e-Science Central workflow
• Provenance as a key element of supporting evidence
• Scalability testing to start in Sept 2014