AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation

CloudBreak: A MapReduce Algorithm for GenomicStructural Variation Detection

Chris Whelan & Kemal Sonmez

Oregon Health & Science University

March 5, 2013

Whelan & Sonmez (OHSU) CloudBreak March 5, 2013 1 / 34

Overview

Background

Current Approaches

MapReduce Framework for SV Detection

Cloudbreak Algorithm

Results

Ongoing Work


Background - High Throughput Sequencing

High Throughput (Illumina) Sequencing produces millions of pairedshort (∼100bp) reads of DNA from an input sample

The challenge: use these reads to find characteristics of DNA samplerelevant to disease or phenotype

The approach: In resequencing experiments, align short reads to areference genome for the species and find the differences

Sequencing error, diploid genomes, hard to map repetitive sequencescan make this difficult

Need high coverage (eg 30X) to detect all single nucleotidepolymorphisms (SNPs); results in large data sets (100GB compressedraw data for human)


Structural Variations

Harder to detect than SNPs are structural variations: deletions,insertions, inversions, duplications, etc.

Generally events that affect more than 40 or 50 bases

The majority of variant bases in a normal individual genome are dueto structural variations (primarily insertions and deletions)

Variants are associated with cancer, neurological disease


SV Detection Approaches

Four main algorithmic approaches

Read pair (RP): Look for paired reads that map to the reference at adistance or orientation that disagrees with the expected characteristicsof the libraryRead depth (RD): Infer deletions and duplications from the number ofreads mapped to each locusSplit read mapping (SR): Split individual reads into two parts, see ifyou can map them to either side of a breakpointDe novo assembly (AS): assemble the reads into their original sequenceand compare to the reference

Hybrid approaches


SV Detection from Sequencing Data

Mills et al. Nature 2011


SV Detection is Hard

Sensitivity and FDR of deletion detection methods used on 1,000 GenomesProject.

Mills et al. Nature 2011


Read-pair (RP) SV Detection

Building the sample library involves selecting the size of DNAfragments

Only the ends of each fragment are sequenced, from the outside in

Therefore the distance between the two sequenced reads (the insertsize) is known - typically modeled as a normal distribution


Discordant read pairs

When reads map to the reference at a greater than expected distanceapart, indicates a deletion in the sample between the mappinglocation of the two ends

Reads that map closer than expected imply an insertion

Reads in the wrong orientation imply an inversion

Medvedev et al. 2009


Read Pair Algorithms

Identify all read pairs with discordant mappings

Attempt to cluster discordant pairs supporting the same variant

Typically ignore concordant mappings

Some algorithms consider reads with multiple mappings by choosingthe mappings that minimize the number of predicted variants: haveshown that this increases sensitivity in repetitive regions of thegenome

Mapping results for a high coverage human genome are very large(100GB of compressed alignment data storing only the best mappingsfor a 30X genome)


MapReduce and Hadoop

Provides a distributed filesystem across a cluster with redundantstorage

Divides computation into Map and Reduce phases: Mappers emitkey-value pairs for a block of data, Reducers process all of the valuesfor each key

Good at handling data sets of the size seen in sequencingexperiments, and much larger

Able to harness a cluster of commodity machines rather than singlehigh-powered servers

Some algorithms translate easily to MapReduce model; others aremuch harder

A natural abstraction in resequencing experiments: use a key for eachlocation in the genome. Examples: SNP calling in GATK or Crossbow


SV Detection in MapReduce

Clustering of read pairs as in traditional RP algorithms typicallyinvolves global compuations or graph structures

MapReduce, on the other hand, forces local, parallel computations

Our approach: use MapReduce to compute features for each locationin the genome from alignments relevant to that location

Locations can be small tiled windows to make the problem moretractable

Make SV calls from features computed along the genome in apost-processing step


An Algorithmic Framework for SV Detection in MapReduce

1: job Alignment

2: function Map(ReadPairId rpid,ReadId r ,ReadSequence s,ReadQuality q)

3: for all Alignments a ∈ Align(< s, q >) do

4: Emit(ReadPairId rpid,Alignment a)

5: function Reduce(ReadPairId rpid,Alignments a1,2,...)

6: AlignmentPairList ap ← ValidAlignmentPairs (a1,2,...)

7: Emit(ReadPairId rp,AlignmentPairList ap)

8: job Compute SV Features

9: function Map(ReadPairId rp,AlignmentPairList ap)

10: for all AlignmentPairs < a1, a2 >∈ ap do

11: for all GenomicLocations l ∈ Loci (a1, a2) do

12: ReadPairInfo rpi ← < InsertSize(a1, a2),AlignmentScore(a1, a2) >

13: Emit(GenomicLocation l,ReadPairInfo rpi)

14: function Reduce(GenomicLocation l,ReadPairInfos rpi1,2,...)

15: SVFeatures φl ← Φ(InsertSizes i1,2,...,AlignmentScores q1,2,...)

16: Emit(GenomicLocation l, SVFeatures φl )

17: StructuralVariationCalls svs ← PostProcess(φ1,2,...)


Three user-defined functions

This framework leaves three functions to be defined

May be many different approaches to take within this framework,depending on the application

Loci :〈a1, a2〉 → Lm ⊆ L

Φ : {ReadPairInfo rpim,i ,j} → RN

PostProcess : {φ1, φ2, . . . , φN} → {〈SVType s, lstart , lend〉}


Cloudbreak implementation

We focus on detecting deletions and small insertions

Implemented as a native Hadoop application

Use features computed from fitting a mixture model to the observeddistribution of insert sizes at each locus

Process as many mappings as possible for ambiguously mapped reads


Local distributions of insert sizes

Estimate distribution of insert sizes observed at each window as aGaussian mixture model (GMM)

Similar to idea in MoDIL (Lee et al. 2009)

Use a constrained expectation-maximization algorithm to find mean,weight of second component. Constrain one component to have thelibrary mean insert size, and constrain both components to have thesame variance. Find mean and weight of the second component.

Features computed include the log likelihood ratio of fittwo-component model to the likelihood of the insert sizes under amodel with no variant: normal distribution under library parameters.

Other features: weight of the second component, estimated mean ofthe second component.


Local distributions of insert sizes

0 100 200 300 400 500

0 100 200 300 400 500

0 100 200 300 400 500

0 100 200 300 400 500

No Variant

Homozygous Deletion

Heterozygous Deletion

Heterozygous Insertion

Lee et al. 2009


Cloudbreak output example


Handling ambiguous mappings

Incorrect mappings of read pairs are unlikely to form clusters of insertsizes at a given window

Before fitting GMM, remove outliers using a nearest neighbormethod: If kth nearest neighbor of each mapped pair is greater thanc * (library fragment size SD) away, remove that mapping

Control number of mappings based on an adaptive cutoff foralignment score: Discard mapping m if the ratio of the best alignmentscore for that window to the score of m is larger than some cutoff.This allows visibility into regions where no reads are mappedunambiguously.


Postprocessing

First extract contiguous genomic loci where the log-likelihood ratio ofthe two models is greater than a given threshold.

To eliminate noise we apply a median filter with window size 5.

Let µ′ be the estimated mean of the second component and µ be thelibrary insert size. We end regions when µ′ changes by more than60bp (2σ), and discard regions where the length of the region differsfrom µ′ by more than µ.

Cloudbreak looses some resolution to breakpoint location based ongenome windows and filters.


Results Comparison

We compare Cloudbreak to a selection of widely used algorithmstaking different approaches:

Breakdancer (Chen et al. 2009): Traditional RP based approach

DELLY (Rausch et al. 2012): RP based approach with SR refinementof calls

GASVPro (Sindi et al. 2012): RP based approach, uses ambiguousmappings of discordant read pairs which it resolves through MCMCalgorithm; looks for RD signals at predicted breakpoint locations byexamining concordant pairs

Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairswhere only one read could be mapped and searches for split readmappings for the other read

MoDIL (Lee et al. 2009): Mixture of distributions; only on simulateddata due to runtime requirements.


Simulated Data

Very little publicly available NGS data from a genome with fullycharacterized structural variations

Can match algorithm output to validated SVs, but dont know if novelpredictions are wrong or undiscovered.

Way to get a simulated data set with ground truth known and realisticevents: take a (somewhat) fully characterized genome, apply variantsto reference sequence, simulate reads from modified reference.

Use Venter genome (Levy et al, 2007), chromosome 2.

To simulate heterozygosity, randomly assign half of the variants to behomozygous and half heterozygous, and create two modifiedreferences.

Simulated 100bp paired reads with a 100bp insert size to 30Xcoverage.


ROC curve for Chromosome 2 Deletion Simulation

0 100 200 300 400

050

100

150

200

250

300

350

Deletions in Venter diploid chr2 simulation

False Positives

True

Pos

itive

s

CloudbreakBreakdancerPindelGASVProDELLY

Caveat: Methods perform better on simulated data than on realwhole genome datasets.


Ability to find simulated deletions by size at 10% FDR

Number of deletions found in each size class (number of exclusivepredictions for algorithm in that class)

Cloudbreak competitive for a range of size classes

40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp

Total Number 224 84 82 31 26

Cloudbreak 47 ( 7) 50 ( 2) 55 ( 4) 12 ( 4) 15 (0)

Breakdancer 52 ( 10) 49 ( 2) 49 (0) 7 (0) 14 (0)

GASVPro 31 ( 4) 25 (0) 23 (0) 2 (0) 6 (0)

DELLY 22 ( 2) 56 ( 3) 40 (0) 8 (0) 12 (0)

Pindel 60 ( 35) 16 (0) 41 ( 2) 1 (0) 12 (0)


Insertions in Simulated Data

0 20 40 60 80

020

4060

80Insertions in Venter diploid chr2 simulation

False Positives

True

Pos

itive

s CloudbreakBreakdancerPindel


NA18507 Data Set

Well studied sample from a Yoruban male individual

High quality sequence to 37X coverage, 100bp reads with a 100bpinsert size

We created a gold standard set of deletions from three differentstudies with low false discovery rates: Mills et al. 2011, HumanGenome Structural Variation Project (Kidd et al. 2008), and the 1000Genomes Project (Mills et al. 2011)


ROC Curve for NA18507 Deletions

All algorithms look much worse on real data (could be lack ofcomplete truth)

0 5000 10000 15000

050

010

0015

0020

00

NA18507

Novel Predictions

True

Pos

itive

s

CloudbreakBreakdancerPindelGASVProDELLYCloudbreak


Ability to find NA18507 deletions by size

Using the same cutoffs that yielded a 10% FDR on the simulatedchromosome 2 data set, adjusted for the difference in coverage from30X to 37X.

Cloudbreak identifies more small deletions

Cloudbreak contributes more exclusive predictions

Prec. Recall 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp

Total Number 7,466 235 218 110 375

Cloudbreak 0.0978 0.115 423 ( 179) 128 ( 9) 158 ( 8) 70 ( 3) 186 ( 12)

Breakdancer 0.122 0.112 261 ( 41) 132 ( 8) 167 ( 1) 92 (0) 288 ( 10)

GASVPro 0.134 0.0401 104 ( 17) 37 ( 2) 77 (0) 26 (0) 93 (0)

DELLY 0.0824 0.091 143 ( 9) 125 ( 7) 158 ( 1) 83 ( 1) 256 ( 3)

Pindel 0.16 0.0685 149 ( 12) 57 (0) 140 ( 1) 58 (0) 172 ( 2)


Ability to detect deletions in repetitive regions

Detected deletions on the simulated and NA18507 data sets identifiedby each tool, broken down by whether the deletion overlaps with aRepeatMasker-annotated element.

Simulated Data NA18507

Non-repeat Repeat Non-repeat Repeat

Total Number 120 327 553 7851

Cloudbreak 28 ( 4) 151 ( 13) 204 ( 46) 761 ( 165)

Breakdancer 29 ( 5) 142 ( 7) 186 ( 21) 754 ( 39)

GASVPro 15 ( 2) 72 ( 2) 71 ( 6) 266 ( 13)

DELLY 21 ( 2) 117 ( 3) 147 ( 11) 618 ( 10)

Pindel 18 ( 9) 112 ( 28) 103 ( 4) 473 ( 11)


Genotyping Deletions

We can use the mixing parameter that controls the weight of the twocomponents in the GMM to accurately predict deletion genotypes.

By setting a simple cutoff of .2 on the average value of the weight ineach prediction, we were able to achieve 86.7% and 94.9% accuracyin predicting the genotype of the true positive deletions we detectedin the simulated and real data sets, respectively.

Actual Genotypes

Simulated Data NA18507

Homozygous Heterozygous Homozygous Heterozygous

PredictedGenotypes

Homozygous 88 3 70 11

Heterozygous 18 70 4 209


Running Times

Running times (wall) on both data sets

Cloudbreak took approx. 150 workers for simulated data, 650 workersfor NA18507 (42m in MapReduce)

Breakdancer and DELLY were run in a single CPU but can be set toprocess each chromosome independently (10X speedup)

Pindel was run in single-threaded mode

MoDIL run on 200 cores

Simulated Chromosome 2 Data NA18507

Cloudbreak 835s 106mBreakdancer 653s 36h

GASVPro 3339s 33hDELLY 1964s 208mPindel 1336s 38h

MoDIL 48h **


Ongoing work: Generate additional features, improvepostprocessing

Goals: increase accuracy and breakpoint resolution

Features involving split read mappings or pairs in which only one endis mapped

Features involving sequence and sequence variants

Annotations of sequence features and previously identified variants

Apply machine learning techniques: conditional random fields, DeepLearning

Potential future work: add local assembly of breakpoints


Ongoing work: automate deployment and execution oncloud providers

Many researchers don’t have access to Hadoop clusters, or serverspowerful enough process these data sets

On-demand creation of clusters with cloud providers can becost-effective, especially with spot pricing

Developing scripts to automate on-demand construction of Hadoopclusters in cloud (Amazon EC2, Rackspace) using Apache Whirrproject

Bottleneck: transferring data into and out of the cloud


Conclusions

Novel approach to applying MapReduce algorithm to structuralvariation problem

Make insert size distribution clustering approaches have feasible runtimes

Improved accuracy over existing algorithms, especially in repetitiveregions

Ability to accurately genotype calls

Cost of additional CPU hours, somewhat less breakpoint resolution


Documents

AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation