Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Dipping into Guacamole
Tim O’Donnell & Ryan Williams
NYC Big Data Genetics Meetup
Aug 11, 2016
Who we are: Hammer Lab
● Computational lab in the
department of Genetics
and Genomic Sciences at
Mount Sinai
● Principal investigator:
Jeff Hammerbacher
● Focus on informatics for
cancer immunotherapy
● Software developed at
github.com/hammerlab
Guacamole● Defines abstractions for working with aligned
short reads on Apache Spark
● Two somatic callers implemented:
○ Somatic Joint Caller makes use of multiple DNA and RNA
samples from the same patient
○ Assembly Caller performs local reassembly for improved
indel detection
● Calls variants on 250X whole genome
tumor/normal samples in 30 minutes on our 100
node Hadoop cluster
● Callers are still primitive, validation and tuning is
ongoing
Talk overview
● Background on DNA, RNA, and sequencing
● Guacamole infrastructure
● Variant calling on Guacamole
● Preliminary results
● Final thoughts
Image source: https://midhathsblog.wordpress.com/2012/04/24/a-complex-mechanism-preventing-mutation-of-important-cells/
Applications of cancer sequencing
● Understand the biology and find new drug targets
● Identify subtypes with similar prognosis or treatment
response
● Understand the non-cancer cells surrounding the tumor
● Trace how metastases spread
● Clinical
○ Diagnose
○ Select targeted therapies
Therapeutic vaccine
Image source: http://www.ncbi.nlm.nih.gov/pubmed/24777245
DNA, RNA, and sequencing
Image source: https://en.wikipedia.org/wiki/DNA#/media/File:DNA_Structure%2BKey%2BLabelled.pn_NoBB.png
Image source: http://www-tc.pbs.org/wgbh/nova/assets/img/ten-great-advances-evolution/image-05-large.jpg
Small variation
Next generation sequencing
● Pyrosequencing (454)
● Sequencing by ligation (SOLiD)
● Semiconductor (Ion Torrent)
● Bridge amplification + reversible terminator (Illumina)
● Single molecule real time (PacBio)
● Nanopore sequencing (Oxford Nanopore)
Image source: https://www.sec.gov/Archives/edgar/data/913275/000095012306014236/x27170e425.htm
Solexa / Illumina
Image source:
Base calling
Read mapping
Somatic Variant Calling
Agreement on somatic variant calls across tools is surprisingly poor
Krøigård, A.B. et al., 2016. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and
Targeted Deep Sequencing Data. Plos One, 11(3), p.e0151664. Available at:
http://dx.plos.org/10.1371/journal.pone.0151664.
Guacamole infrastructure
Apache Spark
● Generalization of MapReduce
● Much more expressive than MapReduce
● Can work on data iteratively in memory
https://www.mapr.com/blog/getting-started-spark-web-ui
Existing work
● GATK uses a custom built-in mapreduce framework, but
cannot take advantage of co-localizing compute and data
● The ADAM project is also working on tools similar to
Guacamole
https://software.broadinstitute.org/gatk/guide/article?id=1988
Guacamole infrastructure
● Distributed implementations of functionality used by
most variant callers
● High level: build “pileups”, call variants
Pileups
A pileup gives the read bases aligned to a particular genomic
position
A C T G G G C A A A A A A C T T G G C
A C T G G G C A A A A A A C T T G
C T G G G C A A A A A A T T T G G C
C A A A T T T G G C
Reference Sequence
Rea
ds
A typical Guacamole analysis
1. Partition the genome
2. Partition reads according to (1)
3. Build pileups at each site
4. Apply user-supplied function at each pileup
5. Write output
Coverage
Beware of the long tail! Max depth is often >100k
Step 1: Partition the genome● Partition the genome into intervals, balancing the number of reads
overlapping each partition
● Each interval will correspond to one Spark partition
Chromosome 1 Chromosome 2 Chromosome …
Partition NPartition 1 Partition 2 …
Extremely high depth (>500k) regions are removed
● All-to-all shuffle of reads based on the genomic partition in Step 1
● A copy of each read goes to each partition it overlaps
Read 2
Step 2: Partition reads
Chromosome 1, positions 1000-1500
Read 1 Read 3
Partition 1 Partition 2
(Partition 1, Read 1)(Partition 1, Read 2)
(Partition 2, Read 2)(Partition 2, Read 3)
Reads overlapping multiple partitions are duplicated
…
…
…
…
…
…
Step 3: Each partition streams through reads to generate pileups
Partition 1A C T G G G C A A A A A A C T T G G C
A C T G G G C A A A A A A C T T G
C T G G G C A A A A A A T T T G G C
C A A A T T T G G C
C T T G G C
Pileup at chr1:1209
Sorted reads
Read at chr1:1200Read at chr1:1201Read at chr1:1209Read at chr1:1213...
f(chr1:1209, [A,A,C])
User-supplied function f is called on each pileup
Implementation notes
● Read partitioning is accomplished with Spark’s
repartitionAndSortWithinPartitions to send
(partition ID, read) pairs to each spark partition and sort
by start position
● We never materialize pileups into RDDs. Just stream
through reads and build pileups on the fly
Other features of the framework
● Sliding windows: for each site, expose all reads within
some margin in each direction. Used instead of Pileups
by more sophisticated callers
● “Regions”: generalization of reads. Things such as
variants align to the genome just like a read and can be
worked with similarly
Variant calling on Guacamole
Somatic joint caller motivation
For many studies we have
multiple tumor samples
from the same patient
● DNA and RNA
● Metastases
● Pre and post-treatment,
relapse, etc.Example: the Australian Ovarian Cancer Study performed whole genome and RNA sequencing on multiple tumor samples from 12 patients
Somatic joint caller
● Calls SNVs and small indels on any number of tumor and
normal samples from the same patient
● Also considers RNA-seq when available
● Likelihood and filters are similar to Mutect
Mixtures
● Across samples, select a consensus variant allele
● For each sample, define a reference mixture and a variant mixture
Likelihoods
● For each sample, using the bases sequenced at the site
and their quality scores, compute two likelihoods:○ Likelihood of data given the reference mixture○ Likelihood of the data given the variant mixture
Priors and posteriors
Posterior Likelihood Prior
Remember Bayes Rule?
A very simple prior
● Given no data, how surprised would we be to find a
somatic variant?
Triggering
● For each sample we compare the posterior probabilities
of the variant and reference mixtures
● We also make a “pooled” sample by combining all the
tumor DNA reads across samples, and compute the same
quantities for that
● If the variant mixture has a higher probability in any
sample or in the pooled sample, we trigger a variant call
Incorporating RNA as a change of prior
● RNA alone should not trigger a variant call (e.g. RNA
editing)
● RNA support should make us more likely to call a variant
● Currently implement as a change of prior
Filters
● Likelihood computation is naive: assumes read bases are
independent of each other given the underlying mixture
● This is not true, as there are many well documented
sequencing artifacts
● Similarly to Mutect, we apply filters to calls to deal with
these artifacts
● Currently implemented filters:○ Strand bias○ Multiallelic sites○ Insufficient normal○ Poor mapping quality
Assembly calling
● We also have an experimental germline caller that does
assembly similar to GATK HaplotypeCaller
● Currently testing on Illumina Platinum genomes
● We hope to combine this with joint calling eventually to
make something comparable to Mutect 2
Results
Testing cluster
Hardware
Nodes 100
Cores 2400
Memory 12.5 TB
Storage 3.6 PB
Software
Spark 1.6.1
Hadoop 2.6.0-cdh5.5.1
OS CentOS 7.2.1511
Guacamole speed
Guacamole
2 WGS samples (DREAM Synth4) 22 minutes
3 Whole Genome Samples (AOCS-034)
31 minutes
10 Whole Exome Samples (PT189) 52 minutes
Mutect
2 WGS samples(DREAM Synth4)chromosome 1
158 minutes
We compare to chromosome 1 single-node runtime because Mutect is typically parallelized by chromosome, and chromosome 1 takes the longest
Genome partitioning
DREAM Synth4 dataset
Usually drop about 10 kilobases of sequence due to extremely high depth
Benchmarks
Validated Non-validated
TCGA validated calls (sensitivity only) Australian ovarian cancer study
patient 034
DREAM SYNTH1
We are collecting variant calling benchmarks at
http://github.com/hammerlab/variant-calling-benchmarks
TCGA: Source of validated variants?
● A small fraction of TCGA
samples have targeted
sequencing validation of variant
calls
● These can be used to measure
recall, but not precision, of a
caller
Sensitivity on TCGA validated variants
● Thresholds
chosen by eye
performed poorly
● Tuning is difficult
using only recall
DREAM SMC Challenge
● Public contest for somatic variant calling
● Synthetic data for a number of “patients” with
increasingly difficult mutations to call○ Eventually there will be validation performed on real data
Hand picked thresholds perform poorly
Model-based optimization on synth1
Features
● Raw likelihood
● Difference of ref and alt
likelihood
● Variant allele fractions
● Allele depths
● Strand bias
Methodology
● Logistic Regression
● 1:1 train/test split
Random forest does better
Australian Ovarian Cancer Study
● Whole genome sequencing of two tumors from a patient
in the Australian Ovarian Cancer Study
● Does the joint caller find new variants by pooling the
samples?
Calls in agreement with published set were triggered by both individual and pooled samples
Probably indicates our pooled trigger is over-eager
Calls not present in the published set were triggered by a mix
Final thoughts
Open questions
● Should we stick with hard filters or switch to a
data-driven model?
● How to integrate joint calling with assembly-based
calling?
● Validation
Spark takeaways
● If you have experience working with distributed systems,
you may be able to use Spark effectively on your problem
● It is not magic, and will not keep you from having to think
hard about problems like skew
● Ease of debugging has been poor, but is improving
● Finding the right settings for memory sizes, number of
cores per node, etc. can be hard. These parameters
change for each new dataset we run on
Thanks!