Tools for Using NIST Reference Materials

Preview:

DESCRIPTION

Tools for Using NIST Reference Materials

Citation preview

Genome in a Bottle: Tools for Using NIST Reference Materials

Next Generation Diagnostics Summit Short CourseAugust 2014

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Learning Objectives

• How can Genome in a Bottle Reference Materials help with validating NGS assays?

• Comparing your variant calls to high-confidence calls

• Tools available for understanding potential false positives and false negatives

• Examples of how labs are using our high-confidence calls

NIST-hostedGenome in a Bottle Consortium

• Infrastructure for performance assessment of NGS– support science-based regulatory

oversight

• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…

• Genome in a Bottle Consortium is developing standards to address this…– human genomes as Reference Materials

(RMs)• characterize and disseminate by NIST

– tools and methods to use these RMs• common sequencing instruments• bioinformatics workflows.

http://genomeinabottle.org

Whole genome sequencing technologies disagree about 100,000’s of variants

3,198,316 (80.05%)

125,574 (3.14%)

Platform #1

Platform #2

Platform #3

230,311 (5.76%)

121,440 (3.04%)

208,038 (5.21%)

71,944 (1.80%)

39,604 (0.99%)

# SNPs (% of SNPs detected

by any platform)

Bioinformatics programs also disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Measurement ProcessSample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified

for their variants against a reference sequence, with confidence estimates

gene

ric m

easu

rem

ent p

roce

ss

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet

planned as NIST RM

8

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Assigning confidence to genotypes

High-confidence sites• Sequencing/bioinformatics

methods agree or we understand the biases causing disagreement

• At least some methods have no evidence of bias

• Inherited as expected

Less confident sites• In a region known to be

difficult for current technologies

• State reasons for lower confidence

• If a site is near a low confidence site, make it low confidence

Reasons we exclude regions from high-confidence set

12

Challenges with assessing performance

• All variant types are not equal

• All regions of the genome are not equal– Homopolymers, STRs,

duplications– Can be similar or

different in different genomes

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic

accuracy measures not well posed

Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels

• ~77% of the genome

NIST Plays a Role in the First FDA Authorization for Next-Generation SequencerNovember 20, 2013

Integrating NIST Call Sets into a Validation Workflow

Validation ReportFalse Positive Ratio FPR=FP/(FP+TN)

False Discovery Rate FDR=FP/(FP + TP)

Sensitivity Sens. = TP/(TP+FN)

Specificity Spec. = TN/(FP +TN)

Balanced Accuracy (Sens. + Spec.)/2

16

GCAT – Interactive Performance Metrics

• NIST is working with GCAT to use our highly confident variant calls

• Assess performance of many combinations of mappers and variant callers

• Currently assesses only exome sequencing

• www.bioplanet.com/gcat

GCAT Tests

GCAT Variant Calling Tests

Pre-run Tests

Upload your own variant calls

GCAT – Upload your own exome calls

Background• Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory

agencies (CAP).• CWES test requires stringent validation per CAP criteria to establish performance

metrics of the test.

Utilizing NIST data in validation of CWES Test

• Sequence and call variants of NA12878 at CHOP• CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file• Compare CHOP dataset to NIST data set for concordance

NIST Data Set Details:*High quality reference data set on NA12878 (Dec. 2013)*NIST’s highly confident Region of Interests (ROI) *Variants called in 219,222 regions on hg19 assembly

*: National Institute of Standards and Technology

Analytical Validation of Clinical Whole-Exome Sequencing (CWES) Test

SENSITIVITY /SPECIFICITY RefGene +/- 15bp (SSV5+)

CHOP NIST

TPSNVs: 18480 INDELs: 396

FPSNVs: 26INDELs: 3

FNSNVs: 63INDELs: 30

FP: False PositiveTP: True PositiveFN: False NegativeTN: True Negative

SNVs INDELsSensitivity (TP/TP+FN) 99.66% 92.96%Specificity (TN/TN+FP) ~100% ~100%FDR (FP/FP+TN) 0.02% 0.08%Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100%

TN = NIST highly confident regions – CHOP ROIs

Further analysis on presumptive 93 FNs and 29 FPs

63 SNVs 30 INDELs

93 FNs

29 FPs

26 SNVs 3 INDELs

Using the GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of questionable calls

GeT-RM Load alignments for visualization

Chr6:151669820 Chr6:151669828

Difficult site in homopolymer in intron of gene AKAP12

Chr1:1666303

SNP in Gene SLC35E2, which is also in a pseudogene and a segmental duplication

SegmentalDuplication

Pseudo-gene

StructuralVariant

Feedback from MoCha lab in NCI • We built a targeted amplicons NGS assay for

detecting mutations in clinical tumor specimens• To assess the assay’s specificity, we compared 84

runs of CEPH NA12878 data from our assay with NIST’s consensus variant list (VCF v2.15)

• We observed a high overall concordance with a few FP variants in homopolymeric regions unique in our platform

• We concluded that NIST GIAB is a useful reference standard to evaluate assay specificity

Using Genome in a Bottle calls to benchmark clinical exome sequencing

at Mount Sinai School of Medicine

“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”

Benchmarking somatic variant callingat Qiagen

HSPH – Brad Chapman Comparing variant callers

http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

NextSeq: New Chemistry – Does it work?

Whole Genome Metrics NextSeq500 HiSeq2500% Genome Covered (>= 10X in Q20 bases) 96% 96%

Mean Coverage in Q20 Bases 28.3X 31.8X

SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%)

InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%)

Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90%

Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54%

Ion Benchmarking I

Ion Benchmarking II

Command-line tools for variant benchmarking

• USeq VCFComparator– http://sourceforge.net/projects/useq/

• RTG vcfeval– ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

• bcbio.variation– http://bcbio.wordpress.com/2013/05/06/framework-

for-evaluating-variant-detection-methods-comparison-of-aligners-and-callers/

• SMaSH– http://smash.cs.berkeley.edu/

How Can I Get Involved?• Use our integrated SNP/indel

genotypes for NA12878 and give us feedback– Cells and DNA currently available from

Coriell– NIST RM available late 2014

• Sequencing/analyzing the new Genome in a Bottle samples

• Help with Structural Variant calls• Help with analyzing data from long-

read technologies• Attend our biannual workshops

(January in CA, August in MD)• Help develop methods to measure

performance using our well-characterized genomes

http://genomeinabottle.org

Email: Justin Zook - jzook@nist.govMarc Salit – salit@nist.gov

Slides on slideshare at:http://www.slideshare.net/GenomeInABottle

Recommended