Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic

Dipping into Guacamole

Tim O’Donnell & Ryan Williams

NYC Big Data Genetics Meetup

Aug 11, 2016

Who we are: Hammer Lab

● Computational lab in the

department of Genetics

and Genomic Sciences at

Mount Sinai

● Principal investigator:

Jeff Hammerbacher

● Focus on informatics for

cancer immunotherapy

● Software developed at

github.com/hammerlab

http://github.com/hammerlab

http://github.com/hammerlab

Guacamole● Defines abstractions for working with aligned

short reads on Apache Spark

● Two somatic callers implemented:

○ Somatic Joint Caller makes use of multiple DNA and RNA

samples from the same patient

○ Assembly Caller performs local reassembly for improved

indel detection

● Calls variants on 250X whole genome

tumor/normal samples in 30 minutes on our 100

node Hadoop cluster

● Callers are still primitive, validation and tuning is

ongoing

Talk overview

● Background on DNA, RNA, and sequencing

● Guacamole infrastructure

● Variant calling on Guacamole

● Preliminary results

● Final thoughts

Image source: https://midhathsblog.wordpress.com/2012/04/24/a-complex-mechanism-preventing-mutation-of-important-cells/

Applications of cancer sequencing

● Understand the biology and find new drug targets

● Identify subtypes with similar prognosis or treatment

response

● Understand the non-cancer cells surrounding the tumor

● Trace how metastases spread

● Clinical

○ Diagnose

○ Select targeted therapies

Therapeutic vaccine

Image source: http://www.ncbi.nlm.nih.gov/pubmed/24777245

DNA, RNA, and sequencing

Image source: https://en.wikipedia.org/wiki/DNA#/media/File:DNA_Structure%2BKey%2BLabelled.pn_NoBB.png

Image source: http://www-tc.pbs.org/wgbh/nova/assets/img/ten-great-advances-evolution/image-05-large.jpg

Small variation

Next generation sequencing

● Pyrosequencing (454)

● Sequencing by ligation (SOLiD)

● Semiconductor (Ion Torrent)

● Bridge amplification + reversible terminator (Illumina)

● Single molecule real time (PacBio)

● Nanopore sequencing (Oxford Nanopore)

Image source: https://www.sec.gov/Archives/edgar/data/913275/000095012306014236/x27170e425.htm

Solexa / Illumina

Image source:

Base calling

Read mapping

Somatic Variant Calling

Agreement on somatic variant calls across tools is surprisingly poor

Krøigård, A.B. et al., 2016. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and

Targeted Deep Sequencing Data. Plos One, 11(3), p.e0151664. Available at:

http://dx.plos.org/10.1371/journal.pone.0151664.

Guacamole infrastructure

Apache Spark

● Generalization of MapReduce

● Much more expressive than MapReduce

● Can work on data iteratively in memory

https://www.mapr.com/blog/getting-started-spark-web-ui

Existing work

● GATK uses a custom built-in mapreduce framework, but

cannot take advantage of co-localizing compute and data

● The ADAM project is also working on tools similar to

Guacamole

https://software.broadinstitute.org/gatk/guide/article?id=1988

Guacamole infrastructure

● Distributed implementations of functionality used by

most variant callers

● High level: build “pileups”, call variants

Pileups

A pileup gives the read bases aligned to a particular genomic

position

A C T G G G C A A A A A A C T T G G C

A C T G G G C A A A A A A C T T G

C T G G G C A A A A A A T T T G G C

C A A A T T T G G C

Reference Sequence

Rea

ds

A typical Guacamole analysis

1. Partition the genome

2. Partition reads according to (1)

3. Build pileups at each site

4. Apply user-supplied function at each pileup

5. Write output

Coverage

Beware of the long tail! Max depth is often >100k

Step 1: Partition the genome● Partition the genome into intervals, balancing the number of reads

overlapping each partition

● Each interval will correspond to one Spark partition

Chromosome 1 Chromosome 2 Chromosome …

Partition NPartition 1 Partition 2 …

Extremely high depth (>500k) regions are removed

● All-to-all shuffle of reads based on the genomic partition in Step 1

● A copy of each read goes to each partition it overlaps

Read 2

Step 2: Partition reads

Chromosome 1, positions 1000-1500

Read 1 Read 3

Partition 1 Partition 2

(Partition 1, Read 1)(Partition 1, Read 2)

(Partition 2, Read 2)(Partition 2, Read 3)

Reads overlapping multiple partitions are duplicated

…

…

…

…

…

…

Step 3: Each partition streams through reads to generate pileups

Partition 1A C T G G G C A A A A A A C T T G G C

A C T G G G C A A A A A A C T T G

C T G G G C A A A A A A T T T G G C

C A A A T T T G G C

C T T G G C

Pileup at chr1:1209

Sorted reads

Read at chr1:1200Read at chr1:1201Read at chr1:1209Read at chr1:1213...

f(chr1:1209, [A,A,C])

User-supplied function f is called on each pileup

Implementation notes

● Read partitioning is accomplished with Spark’s

repartitionAndSortWithinPartitions to send

(partition ID, read) pairs to each spark partition and sort

by start position

● We never materialize pileups into RDDs. Just stream

through reads and build pileups on the fly

Other features of the framework

● Sliding windows: for each site, expose all reads within

some margin in each direction. Used instead of Pileups

by more sophisticated callers

● “Regions”: generalization of reads. Things such as

variants align to the genome just like a read and can be

worked with similarly

Variant calling on Guacamole

Somatic joint caller motivation

For many studies we have

multiple tumor samples

from the same patient

● DNA and RNA

● Metastases

● Pre and post-treatment,

relapse, etc.Example: the Australian Ovarian Cancer Study performed whole genome and RNA sequencing on multiple tumor samples from 12 patients

Somatic joint caller

● Calls SNVs and small indels on any number of tumor and

normal samples from the same patient

● Also considers RNA-seq when available

● Likelihood and filters are similar to Mutect

Mixtures

● Across samples, select a consensus variant allele

● For each sample, define a reference mixture and a variant mixture

Likelihoods

● For each sample, using the bases sequenced at the site

and their quality scores, compute two likelihoods:○ Likelihood of data given the reference mixture○ Likelihood of the data given the variant mixture

Priors and posteriors

Posterior Likelihood Prior

Remember Bayes Rule?

A very simple prior

● Given no data, how surprised would we be to find a

somatic variant?

Triggering

● For each sample we compare the posterior probabilities

of the variant and reference mixtures

● We also make a “pooled” sample by combining all the

tumor DNA reads across samples, and compute the same

quantities for that

● If the variant mixture has a higher probability in any

sample or in the pooled sample, we trigger a variant call

Incorporating RNA as a change of prior

● RNA alone should not trigger a variant call (e.g. RNA

editing)

● RNA support should make us more likely to call a variant

● Currently implement as a change of prior

Filters

● Likelihood computation is naive: assumes read bases are

independent of each other given the underlying mixture

● This is not true, as there are many well documented

sequencing artifacts

● Similarly to Mutect, we apply filters to calls to deal with

these artifacts

● Currently implemented filters:○ Strand bias○ Multiallelic sites○ Insufficient normal○ Poor mapping quality

Assembly calling

● We also have an experimental germline caller that does

assembly similar to GATK HaplotypeCaller

● Currently testing on Illumina Platinum genomes

● We hope to combine this with joint calling eventually to

make something comparable to Mutect 2

Results

Testing cluster

Hardware

Nodes 100

Cores 2400

Memory 12.5 TB

Storage 3.6 PB

Software

Spark 1.6.1

Hadoop 2.6.0-cdh5.5.1

OS CentOS 7.2.1511

Guacamole speed

Guacamole

2 WGS samples (DREAM Synth4) 22 minutes

3 Whole Genome Samples (AOCS-034)

31 minutes

10 Whole Exome Samples (PT189) 52 minutes

Mutect

2 WGS samples(DREAM Synth4)chromosome 1

158 minutes

We compare to chromosome 1 single-node runtime because Mutect is typically parallelized by chromosome, and chromosome 1 takes the longest

Genome partitioning

DREAM Synth4 dataset

Usually drop about 10 kilobases of sequence due to extremely high depth

Benchmarks

Validated Non-validated

TCGA validated calls (sensitivity only) Australian ovarian cancer study

patient 034

DREAM SYNTH1

We are collecting variant calling benchmarks at

http://github.com/hammerlab/variant-calling-benchmarks



TCGA: Source of validated variants?

● A small fraction of TCGA

samples have targeted

sequencing validation of variant

calls

● These can be used to measure

recall, but not precision, of a

caller

Sensitivity on TCGA validated variants

● Thresholds

chosen by eye

performed poorly

● Tuning is difficult

using only recall

DREAM SMC Challenge

● Public contest for somatic variant calling

● Synthetic data for a number of “patients” with

increasingly difficult mutations to call○ Eventually there will be validation performed on real data

Hand picked thresholds perform poorly

Model-based optimization on synth1

Features

● Raw likelihood

● Difference of ref and alt

likelihood

● Variant allele fractions

● Allele depths

● Strand bias

Methodology

● Logistic Regression

● 1:1 train/test split

Random forest does better

Australian Ovarian Cancer Study

● Whole genome sequencing of two tumors from a patient

in the Australian Ovarian Cancer Study

● Does the joint caller find new variants by pooling the

samples?

Calls in agreement with published set were triggered by both individual and pooled samples

Probably indicates our pooled trigger is over-eager

Calls not present in the published set were triggered by a mix

Final thoughts

Open questions

● Should we stick with hard filters or switch to a

data-driven model?

● How to integrate joint calling with assembly-based

calling?

● Validation

Spark takeaways

● If you have experience working with distributed systems,

you may be able to use Spark effectively on your problem

● It is not magic, and will not keep you from having to think

hard about problems like skew

● Ease of debugging has been poor, but is improving

● Finding the right settings for memory sizes, number of

cores per node, etc. can be hard. These parameters

change for each new dataset we run on

Thanks!

Documents

Dipping into Guacamole - Meetupfiles.meetup.com/20073565/Guacamole Big Data Genetics...Guacamole Defines abstractions for working with aligned short reads on Apache Spark Two somatic