15
Genome in a Bottle Workshop Small Variant Data Jamboree Justin Zook and Marc Salit NIST Genome-Scale Measurements Group September 15, 2016

Sept2016 smallvar nist intro

Embed Size (px)

Citation preview

Page 1: Sept2016 smallvar nist intro

Genome in a Bottle WorkshopSmall Variant Data Jamboree

Justin Zook and Marc SalitNIST Genome-Scale Measurements

Group

September 15, 2016

Page 2: Sept2016 smallvar nist intro

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Page 3: Sept2016 smallvar nist intro

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

NEW: Reproducible

integration pipeline

with new calls for

NA12878 and PGP

Trios!

Page 4: Sept2016 smallvar nist intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

Page 5: Sept2016 smallvar nist intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

More high-confidence calls match Platinum Genomes

Page 6: Sept2016 smallvar nist intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

Similar extra calls not in Platinum Genomes

Page 7: Sept2016 smallvar nist intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

~80% fewer differences from PG in high confidence regions

Page 8: Sept2016 smallvar nist intro

New calls (v3.3) vs. old calls (v2.19)Example vcf (verily) Stratified

V3.3• 17% of SNPs not assessed

– 23% of SNPs in RefSeq coding– 53% of SNPs in “bad

promoters”• 78% of indels not assessed

– 0.7% difference rate• 17% FP in regions

homologous to decoy

V2.19 • 27% of SNPs not assessed

– 36% of SNPs in RefSeq coding– 82% of SNPs in “bad

promoters”• 78% of indels not assessed

– 1.2% difference rate• 0.2% FP in regions

homologous to decoy

Page 9: Sept2016 smallvar nist intro

Principles of Integration Process

• Form sensitive variant calls from each dataset

• Define “callable regions” for each callset

• Filter calls from each method with annotations unlike concordant calls

• Compare high-confidence calls to other callsets and manually inspect subset of differences– vs. pedigree-based calls– vs. common pipelines– Trio analysis

• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs

Page 10: Sept2016 smallvar nist intro

Criteria for including new callsets• Form sensitive variant

calls from each dataset• Define “callable regions”

for each callset• Good coverage and MapQ• Use knowledge about

technology and manual inspection to exclude repetitive regions difficult for each dataset

• For new callsets, ensure most FNs in callable regions relative to current high-confidence calls are questionable in the current calls

• Filter calls from each method with annotations unlike concordant calls– Annotations for which

outliers are expected to indicate bias should be selected for each callset

Page 11: Sept2016 smallvar nist intro

Ongoing work: With sufficient coverage, 10X phasing seems to specifically identify most SNP

errors identified by pedigree phasing

Collaboration with Nathan Edwards and

Zhezhen Wang at Georgetown

Univ

Page 12: Sept2016 smallvar nist intro

Ongoing work: How can we add more complex events that are not normalized?• Current integration only

breaks into primitives– Some complex calls end

up uncertain– If part of a complex

variant is uncertain, we exclude the whole region

• 3 approaches– Kevin Jacobs vgraph

• Merge all callsets into a single graph

• Still need to work on partial complex calls

– Chen Sun and Paul Medvedev – varmatch• Start with one callset and match

otther callers one at a time, adding in new variants from each

– Sean Irvine and Len Trigg, RTG – vcfeval• Presentation today

Page 13: Sept2016 smallvar nist intro

Ongoing work: GRCh38

• Draft calls for chr20 on GRCh38

• Make calls on mapped reads for Illumina and 10X

• Lift over calls for CG, Ion, and SOLiD

• Preliminary comparisons to PG seem similar to those for GRCh37

Page 14: Sept2016 smallvar nist intro

Ongoing/Future Work and Questions

• Integrate with pedigree calls for NA12878– Mike Eberle, Illumina

• Integrating phasing information from family, linked reads, etc.– Sean Irvine/Len Trigg, RTG

• Integrate complex variants– Sean Irvine/Len Trigg, RTG– Chen Sun/Paul Medvedev,

PSU

• Incorporate more calls in difficult-to-map regions– 10X– Dovetail– PacBio

• How to integrate indels 15-50bp?

• Using ALT loci

Page 15: Sept2016 smallvar nist intro

Acknowledgements

• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA– Liz Mansfield– Zivana Tevak– David Litwack