CloudBreak: A MapReduce Algorithm for GenomicStructural Variation Detection
Chris Whelan & Kemal Sonmez
Oregon Health & Science University
March 5, 2013
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 1 / 34
Overview
Background
Current Approaches
MapReduce Framework for SV Detection
Cloudbreak Algorithm
Results
Ongoing Work
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 2 / 34
Background - High Throughput Sequencing
High Throughput (Illumina) Sequencing produces millions of pairedshort (∼100bp) reads of DNA from an input sample
The challenge: use these reads to find characteristics of DNA samplerelevant to disease or phenotype
The approach: In resequencing experiments, align short reads to areference genome for the species and find the differences
Sequencing error, diploid genomes, hard to map repetitive sequencescan make this difficult
Need high coverage (eg 30X) to detect all single nucleotidepolymorphisms (SNPs); results in large data sets (100GB compressedraw data for human)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 3 / 34
Structural Variations
Harder to detect than SNPs are structural variations: deletions,insertions, inversions, duplications, etc.
Generally events that affect more than 40 or 50 bases
The majority of variant bases in a normal individual genome are dueto structural variations (primarily insertions and deletions)
Variants are associated with cancer, neurological disease
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 4 / 34
SV Detection Approaches
Four main algorithmic approaches
Read pair (RP): Look for paired reads that map to the reference at adistance or orientation that disagrees with the expected characteristicsof the libraryRead depth (RD): Infer deletions and duplications from the number ofreads mapped to each locusSplit read mapping (SR): Split individual reads into two parts, see ifyou can map them to either side of a breakpointDe novo assembly (AS): assemble the reads into their original sequenceand compare to the reference
Hybrid approaches
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 5 / 34
SV Detection from Sequencing Data
Mills et al. Nature 2011
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 6 / 34
SV Detection is Hard
Sensitivity and FDR of deletion detection methods used on 1,000 GenomesProject.
Mills et al. Nature 2011
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 7 / 34
Read-pair (RP) SV Detection
Building the sample library involves selecting the size of DNAfragments
Only the ends of each fragment are sequenced, from the outside in
Therefore the distance between the two sequenced reads (the insertsize) is known - typically modeled as a normal distribution
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 8 / 34
Discordant read pairs
When reads map to the reference at a greater than expected distanceapart, indicates a deletion in the sample between the mappinglocation of the two ends
Reads that map closer than expected imply an insertion
Reads in the wrong orientation imply an inversion
Medvedev et al. 2009
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 9 / 34
Read Pair Algorithms
Identify all read pairs with discordant mappings
Attempt to cluster discordant pairs supporting the same variant
Typically ignore concordant mappings
Some algorithms consider reads with multiple mappings by choosingthe mappings that minimize the number of predicted variants: haveshown that this increases sensitivity in repetitive regions of thegenome
Mapping results for a high coverage human genome are very large(100GB of compressed alignment data storing only the best mappingsfor a 30X genome)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 10 / 34
MapReduce and Hadoop
Provides a distributed filesystem across a cluster with redundantstorage
Divides computation into Map and Reduce phases: Mappers emitkey-value pairs for a block of data, Reducers process all of the valuesfor each key
Good at handling data sets of the size seen in sequencingexperiments, and much larger
Able to harness a cluster of commodity machines rather than singlehigh-powered servers
Some algorithms translate easily to MapReduce model; others aremuch harder
A natural abstraction in resequencing experiments: use a key for eachlocation in the genome. Examples: SNP calling in GATK or Crossbow
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 11 / 34
SV Detection in MapReduce
Clustering of read pairs as in traditional RP algorithms typicallyinvolves global compuations or graph structures
MapReduce, on the other hand, forces local, parallel computations
Our approach: use MapReduce to compute features for each locationin the genome from alignments relevant to that location
Locations can be small tiled windows to make the problem moretractable
Make SV calls from features computed along the genome in apost-processing step
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 12 / 34
An Algorithmic Framework for SV Detection in MapReduce
1: job Alignment
2: function Map(ReadPairId rpid,ReadId r ,ReadSequence s,ReadQuality q)
3: for all Alignments a ∈ Align(< s, q >) do
4: Emit(ReadPairId rpid,Alignment a)
5: function Reduce(ReadPairId rpid,Alignments a1,2,...)
6: AlignmentPairList ap ← ValidAlignmentPairs (a1,2,...)
7: Emit(ReadPairId rp,AlignmentPairList ap)
8: job Compute SV Features
9: function Map(ReadPairId rp,AlignmentPairList ap)
10: for all AlignmentPairs < a1, a2 >∈ ap do
11: for all GenomicLocations l ∈ Loci (a1, a2) do
12: ReadPairInfo rpi ← < InsertSize(a1, a2),AlignmentScore(a1, a2) >
13: Emit(GenomicLocation l,ReadPairInfo rpi)
14: function Reduce(GenomicLocation l,ReadPairInfos rpi1,2,...)
15: SVFeatures φl ← Φ(InsertSizes i1,2,...,AlignmentScores q1,2,...)
16: Emit(GenomicLocation l, SVFeatures φl )
17: StructuralVariationCalls svs ← PostProcess(φ1,2,...)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 13 / 34
Three user-defined functions
This framework leaves three functions to be defined
May be many different approaches to take within this framework,depending on the application
Loci :〈a1, a2〉 → Lm ⊆ L
Φ : {ReadPairInfo rpim,i ,j} → RN
PostProcess : {φ1, φ2, . . . , φN} → {〈SVType s, lstart , lend〉}
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 14 / 34
Cloudbreak implementation
We focus on detecting deletions and small insertions
Implemented as a native Hadoop application
Use features computed from fitting a mixture model to the observeddistribution of insert sizes at each locus
Process as many mappings as possible for ambiguously mapped reads
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 15 / 34
Local distributions of insert sizes
Estimate distribution of insert sizes observed at each window as aGaussian mixture model (GMM)
Similar to idea in MoDIL (Lee et al. 2009)
Use a constrained expectation-maximization algorithm to find mean,weight of second component. Constrain one component to have thelibrary mean insert size, and constrain both components to have thesame variance. Find mean and weight of the second component.
Features computed include the log likelihood ratio of fittwo-component model to the likelihood of the insert sizes under amodel with no variant: normal distribution under library parameters.
Other features: weight of the second component, estimated mean ofthe second component.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 16 / 34
Local distributions of insert sizes
0 100 200 300 400 500
0 100 200 300 400 500
0 100 200 300 400 500
0 100 200 300 400 500
No Variant
Homozygous Deletion
Heterozygous Deletion
Heterozygous Insertion
Lee et al. 2009
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 17 / 34
Cloudbreak output example
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 18 / 34
Handling ambiguous mappings
Incorrect mappings of read pairs are unlikely to form clusters of insertsizes at a given window
Before fitting GMM, remove outliers using a nearest neighbormethod: If kth nearest neighbor of each mapped pair is greater thanc * (library fragment size SD) away, remove that mapping
Control number of mappings based on an adaptive cutoff foralignment score: Discard mapping m if the ratio of the best alignmentscore for that window to the score of m is larger than some cutoff.This allows visibility into regions where no reads are mappedunambiguously.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 19 / 34
Postprocessing
First extract contiguous genomic loci where the log-likelihood ratio ofthe two models is greater than a given threshold.
To eliminate noise we apply a median filter with window size 5.
Let µ′ be the estimated mean of the second component and µ be thelibrary insert size. We end regions when µ′ changes by more than60bp (2σ), and discard regions where the length of the region differsfrom µ′ by more than µ.
Cloudbreak looses some resolution to breakpoint location based ongenome windows and filters.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 20 / 34
Results Comparison
We compare Cloudbreak to a selection of widely used algorithmstaking different approaches:
Breakdancer (Chen et al. 2009): Traditional RP based approach
DELLY (Rausch et al. 2012): RP based approach with SR refinementof calls
GASVPro (Sindi et al. 2012): RP based approach, uses ambiguousmappings of discordant read pairs which it resolves through MCMCalgorithm; looks for RD signals at predicted breakpoint locations byexamining concordant pairs
Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairswhere only one read could be mapped and searches for split readmappings for the other read
MoDIL (Lee et al. 2009): Mixture of distributions; only on simulateddata due to runtime requirements.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 21 / 34
Simulated Data
Very little publicly available NGS data from a genome with fullycharacterized structural variations
Can match algorithm output to validated SVs, but dont know if novelpredictions are wrong or undiscovered.
Way to get a simulated data set with ground truth known and realisticevents: take a (somewhat) fully characterized genome, apply variantsto reference sequence, simulate reads from modified reference.
Use Venter genome (Levy et al, 2007), chromosome 2.
To simulate heterozygosity, randomly assign half of the variants to behomozygous and half heterozygous, and create two modifiedreferences.
Simulated 100bp paired reads with a 100bp insert size to 30Xcoverage.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 22 / 34
ROC curve for Chromosome 2 Deletion Simulation
0 100 200 300 400
050
100
150
200
250
300
350
Deletions in Venter diploid chr2 simulation
False Positives
True
Pos
itive
s
CloudbreakBreakdancerPindelGASVProDELLY
Caveat: Methods perform better on simulated data than on realwhole genome datasets.
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 23 / 34
Ability to find simulated deletions by size at 10% FDR
Number of deletions found in each size class (number of exclusivepredictions for algorithm in that class)
Cloudbreak competitive for a range of size classes
40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp
Total Number 224 84 82 31 26
Cloudbreak 47 ( 7) 50 ( 2) 55 ( 4) 12 ( 4) 15 (0)
Breakdancer 52 ( 10) 49 ( 2) 49 (0) 7 (0) 14 (0)
GASVPro 31 ( 4) 25 (0) 23 (0) 2 (0) 6 (0)
DELLY 22 ( 2) 56 ( 3) 40 (0) 8 (0) 12 (0)
Pindel 60 ( 35) 16 (0) 41 ( 2) 1 (0) 12 (0)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 24 / 34
Insertions in Simulated Data
0 20 40 60 80
020
4060
80Insertions in Venter diploid chr2 simulation
False Positives
True
Pos
itive
s CloudbreakBreakdancerPindel
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 25 / 34
NA18507 Data Set
Well studied sample from a Yoruban male individual
High quality sequence to 37X coverage, 100bp reads with a 100bpinsert size
We created a gold standard set of deletions from three differentstudies with low false discovery rates: Mills et al. 2011, HumanGenome Structural Variation Project (Kidd et al. 2008), and the 1000Genomes Project (Mills et al. 2011)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 26 / 34
ROC Curve for NA18507 Deletions
All algorithms look much worse on real data (could be lack ofcomplete truth)
0 5000 10000 15000
050
010
0015
0020
00
NA18507
Novel Predictions
True
Pos
itive
s
CloudbreakBreakdancerPindelGASVProDELLYCloudbreak
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 27 / 34
Ability to find NA18507 deletions by size
Using the same cutoffs that yielded a 10% FDR on the simulatedchromosome 2 data set, adjusted for the difference in coverage from30X to 37X.
Cloudbreak identifies more small deletions
Cloudbreak contributes more exclusive predictions
Prec. Recall 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp
Total Number 7,466 235 218 110 375
Cloudbreak 0.0978 0.115 423 ( 179) 128 ( 9) 158 ( 8) 70 ( 3) 186 ( 12)
Breakdancer 0.122 0.112 261 ( 41) 132 ( 8) 167 ( 1) 92 (0) 288 ( 10)
GASVPro 0.134 0.0401 104 ( 17) 37 ( 2) 77 (0) 26 (0) 93 (0)
DELLY 0.0824 0.091 143 ( 9) 125 ( 7) 158 ( 1) 83 ( 1) 256 ( 3)
Pindel 0.16 0.0685 149 ( 12) 57 (0) 140 ( 1) 58 (0) 172 ( 2)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 28 / 34
Ability to detect deletions in repetitive regions
Detected deletions on the simulated and NA18507 data sets identifiedby each tool, broken down by whether the deletion overlaps with aRepeatMasker-annotated element.
Simulated Data NA18507
Non-repeat Repeat Non-repeat Repeat
Total Number 120 327 553 7851
Cloudbreak 28 ( 4) 151 ( 13) 204 ( 46) 761 ( 165)
Breakdancer 29 ( 5) 142 ( 7) 186 ( 21) 754 ( 39)
GASVPro 15 ( 2) 72 ( 2) 71 ( 6) 266 ( 13)
DELLY 21 ( 2) 117 ( 3) 147 ( 11) 618 ( 10)
Pindel 18 ( 9) 112 ( 28) 103 ( 4) 473 ( 11)
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 29 / 34
Genotyping Deletions
We can use the mixing parameter that controls the weight of the twocomponents in the GMM to accurately predict deletion genotypes.
By setting a simple cutoff of .2 on the average value of the weight ineach prediction, we were able to achieve 86.7% and 94.9% accuracyin predicting the genotype of the true positive deletions we detectedin the simulated and real data sets, respectively.
Actual Genotypes
Simulated Data NA18507
Homozygous Heterozygous Homozygous Heterozygous
PredictedGenotypes
Homozygous 88 3 70 11
Heterozygous 18 70 4 209
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 30 / 34
Running Times
Running times (wall) on both data sets
Cloudbreak took approx. 150 workers for simulated data, 650 workersfor NA18507 (42m in MapReduce)
Breakdancer and DELLY were run in a single CPU but can be set toprocess each chromosome independently (10X speedup)
Pindel was run in single-threaded mode
MoDIL run on 200 cores
Simulated Chromosome 2 Data NA18507
Cloudbreak 835s 106mBreakdancer 653s 36h
GASVPro 3339s 33hDELLY 1964s 208mPindel 1336s 38h
MoDIL 48h **
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 31 / 34
Ongoing work: Generate additional features, improvepostprocessing
Goals: increase accuracy and breakpoint resolution
Features involving split read mappings or pairs in which only one endis mapped
Features involving sequence and sequence variants
Annotations of sequence features and previously identified variants
Apply machine learning techniques: conditional random fields, DeepLearning
Potential future work: add local assembly of breakpoints
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 32 / 34
Ongoing work: automate deployment and execution oncloud providers
Many researchers don’t have access to Hadoop clusters, or serverspowerful enough process these data sets
On-demand creation of clusters with cloud providers can becost-effective, especially with spot pricing
Developing scripts to automate on-demand construction of Hadoopclusters in cloud (Amazon EC2, Rackspace) using Apache Whirrproject
Bottleneck: transferring data into and out of the cloud
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 33 / 34
Conclusions
Novel approach to applying MapReduce algorithm to structuralvariation problem
Make insert size distribution clustering approaches have feasible runtimes
Improved accuracy over existing algorithms, especially in repetitiveregions
Ability to accurately genotype calls
Cost of additional CPU hours, somewhat less breakpoint resolution
Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 34 / 34