32
Comparing and Benchmarking Large Deletion Callsets Justin Zook NIST Genome-Scale Measurements Group June 27, 2016

160627 giab for festival sv workshop

Embed Size (px)

Citation preview

Page 1: 160627 giab for festival sv workshop

Comparing and Benchmarking Large Deletion Callsets

Justin ZookNIST Genome-Scale Measurements

Group

June 27, 2016

Page 2: 160627 giab for festival sv workshop

Sequencing technologies and bioinformatics pipelines disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Page 3: 160627 giab for festival sv workshop

Sequencing technologies and bioinformatics pipelines disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Who’s right?

Is anyone right?

Page 4: 160627 giab for festival sv workshop

Genome in a Bottle ConsortiumWhole Genome Variant Calling

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance– materials certified for their variants

against a reference sequence, with confidence estimates

• established consortium to develop reference materials, data, methods, performance metrics

• Characterized Pilot Genome NA12878 for small variants– Now AJ Son also

• Ashkenazim Trio, Asian Trio from PGP in process

gene

ric m

easu

rem

ent p

roce

ss

Page 5: 160627 giab for festival sv workshop

Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

Page 6: 160627 giab for festival sv workshop

Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…

Illumina Paired-end WGS

150x150bp250x250bp

~300x/individual~50x/individual

on SRA/FTP SNPs/indels/some SVs

Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs

SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs

Illumina Paired-end WES

100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome

Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome

Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs

Illumina “moleculo” Custom library ~30x by long fragments

on FTP SVs/phasing/assembly

Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing

10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly

PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent

on SRA/FTP SVs/phasing/assembly/STRs

Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly

Nabsys 2.0 ~100kbp N50 nanopore maps

70x on AJ son SVs/assembly

BioNano Genomics 200-250kbp optical map reads

~100x/AJ individual; 57x on Asian son

on FTP SVs/assembly

Page 7: 160627 giab for festival sv workshop

Dataset AJ Son AJ Parents Chinese son Chinese parents

NA12878

Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X

Page 8: 160627 giab for festival sv workshop

Paper describing data…51 authors14 institutions12 datasets7 genomesData described in ISA-tab

Page 9: 160627 giab for festival sv workshop

Integration Methods to Establish Benchmark Small Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Page 10: 160627 giab for festival sv workshop

Integration Methods to Establish Benchmark Small Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

New Version of callsNow available for

NA12878 and HG002

Page 11: 160627 giab for festival sv workshop

How can we extend this approach to SVs?

Similarities to small variants• Collect callsets from

multiple technologies• Compare callsets to find

calls supported by multiple technologies

Differences from small variants• Callsets generally are not

sufficiently sensitive to assume that regions without calls are homozygous reference

• Variants are often imprecisely characterized– breakpoints, size, type, etc.

• Representation of variants is poorly standardized, especially when complex

• Comparison tools in infancy

Page 12: 160627 giab for festival sv workshop

Callsets Contributed so far

Short reads• Illumina

– Spiral Genetics– cortex– Commonlaw– MetaSV– Parliament/assembly– Parliament/assembly-force

• Complete Genomics• CG-SV• CG-CNV• CG-vcfBeta

Long reads and Linked reads• PacBio

• CSHL-assembly• Sniffles• PBHoney-spots and –tails• Parliament/pacbio• Parliament/pacbio-force• MultibreakSV• smrt-sv.dip• Assemblytics-Falcon and-MHAP

• Nanopore mapping• Nabsys force calls

• optical mapping• BioNano with and without haplotype-

aware assembly• 10X Genomics

Page 13: 160627 giab for festival sv workshop

Step 1: Merging calls• Process

– Find union of calls >19bp from all deletion callsets and merge any regions if within 1000 bp (results in 28460 regions)

– Annotate each merged region with fraction covered by calls from each callset

– Split out those overlapping tandem repeats longer than 200bp by >25% (2715 regions)

• Helps mitigate different representations of calls in repetitive regions and imprecision of breakpoints from many callers

• Limitations– may not appropriately call compound heterozygous SVs– Ignores other types of SVs in the region– Loses genotype information

Page 14: 160627 giab for festival sv workshop

Step 2: Find size prediction accuracy

• Find “size prediction accuracy” of each callset by calculating the difference from the median predicted size for regions with calls from >3 callers, and rank callers for <3kb and >3kb size ranges Spiral 0.00%

Cortex 0.24%CGSV 0.65%AssemblyticsFalcon 0.79%CGvcf 1.09%fermikit 1.28%smrtsvdip 1.43%MetaSV 1.57%MultibreakSV 1.62%PBHoneySpots 2.13%AssemblyticsMHAP 2.21%ParliamentAssemblyForce 2.26%CSHLassembly 2.29%ParliamentPacBio 2.92%ParliamentAssembly 3.00%

Spiral 0.04%AssemblyticsFalcon 0.06%CGSV 0.06%CSHLassembly 0.08%AssemblyticsMHAP 0.08%MultibreakSV 0.10%fermikit 0.11%PBHoneyTails 0.38%CommonLaw 0.48%ParliamentPacBio 0.58%smrtsvdip 0.62%MetaSV 1.12%sniffles 1.57%Nabsys2tech01Force 3.02%BioNano 3.67%

Size >3kbSize <3kb

Page 15: 160627 giab for festival sv workshop

Step 3: Find calls supported by 2 techs1. Find calls supported by calls from 2 or more

technologies with size prediction within 20%2. Find sensitivity of each caller to these calls in

size ranges 20-50, 50-100, 100-1000, 1000-3000, and >3000 bp

Page 16: 160627 giab for festival sv workshop

Step 4: Filter questionable calls supported by 2+ technologies

• 316 calls covered >25% by segmental duplication >10kb

• 631 calls with at least one caller predicting a size >2x different from the consensus size

• 34 calls where callsets missing this call from multiple technologies have a multiplied (1-sensitivity) < 2% in this size tranche

• 87 calls that overlap Ns in the reference

Page 17: 160627 giab for festival sv workshop

Number of Calls Supported by 2 Technologies by Size Range

<50bp 50-100bp 100-1000bp 1kb-3kb >3kbpre-filtered 2404 1307 2288 481 600

filtered 2325 1188 1875 379 341

Page 18: 160627 giab for festival sv workshop

Sensitivity to Draft Benchmark Calls<50bp 50-100bp 100-1000bp 1kb-3kb >3kb

AssemblyticsFalcon 0% 55% 68% 59% 45%AssemblyticsMHAP 0% 51% 66% 56% 52%

CGvcf 86% 20% 4% 0% 0%CGCNV 0% 0% 0% 0% 29%CGSV 0% 0% 39% 65% 56%

CSHLassembly 0% 47% 62% 49% 42%sniffles 7% 28% 58% 59% 64%

BioNano 0% 0% 2% 26% 37%Spiral 85% 44% 57% 38% 40%Cortex 39% 15% 7% 2% 0%

CommonLaw 0% 0% 8% 47% 40%PBHoneySpots 0% 39% 63% 9% 0%PBHoneyTails 0% 0% 0% 31% 57%

MetaSV 0% 0% 75% 74% 71%ParliamentPacBio 0% 0% 74% 75% 48%

ParliamentAssembly 0% 0% 65% 44% 2%MultibreakSV 16% 66% 72% 59% 47%

CNVnator 0% 0% 22% 71% 74%ParliamentPacBioForce 1% 45% 72% 31% 18%

ParliamentAssemblyForce 0% 42% 63% 11% 2%BionanoHaplo 0% 0% 0% 36% 49%

NabsysForce160405 0% 0% 5% 25% 28%smrtsvdip 0% 66% 77% 65% 55%fermikit 94% 86% 83% 59% 56%

Page 19: 160627 giab for festival sv workshop

Size distributions

Page 20: 160627 giab for festival sv workshop

Concordance between technologies

All Calls

High-confidence Calls

Page 21: 160627 giab for festival sv workshop

Support for all candidate regions

# of callsets # of technologies

Page 22: 160627 giab for festival sv workshop

Support for benchmark calls

# of callsets # of technologies

Page 23: 160627 giab for festival sv workshop

Possible double deletion

Page 24: 160627 giab for festival sv workshop

Clear 1kb homozygous deletion

Page 25: 160627 giab for festival sv workshop

Possible Complex SV called a deletion

Page 26: 160627 giab for festival sv workshop

Het in Son and hom ref and alt in parents

Page 27: 160627 giab for festival sv workshop

Heterozygous deletions in phased 10X reads

~3kb Heterozygous Deletion

~5kb Heterozygous Deletion

Page 28: 160627 giab for festival sv workshop

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools• vcfeval – Len Trigg• hap.py – Peter Krusche• vgraph – Kevin Jacobs

• Standardized bed files with difficult genome contexts for stratification

Credit: GA4GH, Abby Beeler, Ellie Wood

Stratification of FP RatesHigher FP rates at Tandem Repeats

Page 29: 160627 giab for festival sv workshop

Challenges in Benchmarking Small Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites

• Challenges with benchmarking complex variants near boundaries of high-confidence regions

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics

Page 30: 160627 giab for festival sv workshop

Particular Challenges in Benchmarking SV Calling

• How to establish benchmark calls for difficult regions?• How to establish non-SV regions to assess FP rates?• Multiple dimensions of accuracy:– Predicted SV existence– Predicted SV type– Predicted size– Predicted breakpoints– Predicted exact sequence– Predicted genotype

Page 31: 160627 giab for festival sv workshop

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Page 32: 160627 giab for festival sv workshop

Acknowledgements

• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe– Hemang Parikh

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA– Liz Mansfield

• SV Callset Contributors– CSHL/JHU– Mt Sinai– 10X– Nabsys– Spiral Genetics/Stanford– Heng Li/Mike Lin– DNAnexus– Complete Genomics– Baylor– Bina/Roche– BioNano Genomics– Mark Chaisson– NIH/NCBI– NIH/NHGRI– Can Alkan/Stanford