28
2/9/2015 © 2013-2014 Invitae Corporation. All Rights Reserved | CONFIDENTIAL 1 Using the Genome in a Bottle (GIAB) pilot reference material: Its strengths and limitations for analytic validation of a diagnostic panel test Stephen E. Lincoln Invitae

Jan2015 using the pilot genome rm for clinical validation steve lincoln

Embed Size (px)

Citation preview

Page 1: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 © 2013-2014 Invitae Corporation. All Rights Reserved | CONFIDENTIAL1

Using the Genome in a Bottle (GIAB) pilot

reference material: Its strengths and

limitations for analytic validation of a

diagnostic panel test

Stephen E. Lincoln

Invitae

Page 2: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• Diagnostic tests are ordered in response to a medical

question that needs an answer in order to make a specific

decision for a specific patient

− Can be time critical; Decisions may not be reversible

• Our Job: Provide a highly accurate answer to the question

asked in the time needed

− A complete answer is highly valued

o No matter how challenging (with some limits)

− Extra information is not valued (in most cases)

• Rigorous validation required by CLIA, CAP, the medical

community and payers

− Focus on Analytic (not Clinical) validation here…

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 2

Genetic Diagnostic Tests ≠ Research

Page 3: Jan2015 using the pilot genome rm for clinical validation steve lincoln

29 Gene Hereditary Cancer Panel

Sub-Panel Genes Total Gene names

BRCA1/2 2 2 BRCA1, BRCA2

Other High-Risk Breast/Ovarian 4 6 CDH1, PTEN, STK11, TP53

Moderate-Risk Breast/Ovarian 6 12 ATM, BRIP1, CHEK2, NBN, PALB2, RAD51C

Lynch Syndrome 5 17 EPCAM, MLH1, MSH2, MSH6, PMS2

Other Hereditary Cancer

Syndromes11 28

APC, BMPR1A, SMAD4, CDK4, CDKN2A,

PALLD, MET, MEN1, RET, PTCH1, VHL

MUTYH 1 29 MUTYH

Page 4: Jan2015 using the pilot genome rm for clinical validation steve lincoln

1. Multiple Enrichment Methods− No one technology delivers adequate coverage of all 29 genes

2. Copy number and other structural variants play a

significant role in addition to sequence variants− CNVs as small as one exon

− Alu insertions

− Tandem duplications

3. Of these 29 genes, a number are “hard”− PMS2 (last 4 exons) and CHEK2 have pseudogenes

− SMAD4 also does, in some people

− MSH2 has a large intronic homopolymer-A immediately next to

a canonical splice site (known to harbor pathogenic mutations)

− CDKN2A has a low complexity 80% GC tandem duplication at

the 5’ Met (also known to harbor pathogenic mutations)

Technical Requirements For These 29 Genes

Page 5: Jan2015 using the pilot genome rm for clinical validation steve lincoln

Study Population

Group N Description Previous Testing

Prospective

Clinical735

Prospectively accrued clinical

casesClinical testing for

BRCA1/2, occasionally

other genes (depending

on case)High-Risk

Clinical

(Total 327)

209

Retrospective cases from a clinical

biobank generally containing

higher-risk individuals

118Cases referred due to known

pathogenic variant in family

Clinical single-site

testing

Reference

Samples36

Reference samples from public

biobanks (Coriell, NIBSC)

Samples carry known

pathogenic variants

Well-Characterized

Genomes (WCGs)7

Reference samples from public

biobanks with high-quality whole

genome sequencing (WGS) data

Variants in 29 cancer

genes extracted from

WGS data; most of

these are benign

Total 1105

1062

Page 6: Jan2015 using the pilot genome rm for clinical validation steve lincoln

7 Well Characterized Genomes (WCGs) Used

NA19239 NA19238

NA19240

CEPH/Utah Pedigree 1463 Yoruba Family Y117

NA12889

✔ ✔

NA12879

NA12890

NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893

NA12877 NA12878

NA12891 NA12892

Geoff Nilsen

Integrated Complete Genomics, Illumina Platinum and other data sets

Mendelian scrub (leveraging data from family members not used in this study)

Page 7: Jan2015 using the pilot genome rm for clinical validation steve lincoln

1. CLIA Validation and Performance Study (pre-GIAB):

• Integrated CG and Illumina Platinum data

• Compared scrubbed data against our Dx test data

2. Later reconciled NA12878 data against GIAB data set

• Substantially the same as our integrated data

Results presented here are a mix of the pre/post GIAB

WCGs in Cancer Panel Validation

Geoff Nilsen, Shan Yang

Page 8: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• 58,708 variants detected (avg. 53 per patient)

• >90% are common polymorphisms (MAF>1% in 1KG)

• >99% are single nucleotide variants (SNVs)

• <0.1% are of the most technically challenging types*− CNVs (single gene to single exon)

− Larger indels (≥10bp)

− Closely-spaced variants (≤25bp)

− Complex variants

− Variants in/near low complexity sequence

Genetic Data for 1105 Individuals x 29 genes

*We believe this largely reflects prevalence, not sensitivity limitations.

Page 9: Jan2015 using the pilot genome rm for clinical validation steve lincoln

Analytic

Validation

Page 10: Jan2015 using the pilot genome rm for clinical validation steve lincoln

Variants Selected in Analytic Validation Study

Type Variants Details

Single Nucleotide Variants (SNVs) 549

Sequence deletions <10 base-pairs 125

Sequence insertions <5 base-pairs 31

Sequence insertions ≥5 base-pairs 4 24, 5 bp

Sequence deletions ≥10 base-pairs 9 126, 40, 19, 15, 11 bp

Complex variants 6 Delins, haplotypes, Homopolymer-associated1

Single exon deletions 9 BRCA1, BRCA2, MSH2, PMS2

Single exon duplications 4 BRCA1, MLH1

Deletions of multiple exons or whole gene 10 BRCA1, MSH2, RAD51C

Duplications of multiple exons or whole gene 6 BRCA1, BRCA2, NBN, SMAD4

Total 750

Se

qu

en

ce

Copy N

um

ber

Some published validation studies have few, if any, examples of these relatively

challenging classes of variation2,3

1. MSH2:c.942+3A>T

2. Bosdet et al, J Mol Dx, 2013

3. Chong et al, PLOS One, 2014

“Hard Stuff”

All could be directly compared between NGS panel and reference/orthogonal data.

Page 11: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• 7 Samples Contributed 310 of 750 selected variants− All variants in assay targets in the WCG data sets were used

− 41% of the total set of variants came from 0.6% of the samples

• In 15 of 29 genes the 7 WCGs doubled (or more) the

selected variant count

• WCGs added variants in one gene (PTCH1) which

otherwise had none selected

• Saved us 310 Sanger confirmations− Unlike confirmation, WCGs contribute both to sensitivity and

specificity measurements in a strong way

• As a replenishable resource, it’s easy to rerun WCGs

WCGs Contribution to Analytic Validation Study

Page 12: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• No coding variants in 5 of 29 genes− CDKN2A, PALB2, RAD51C, SMAD4

− CHEK2 (a special case)

• Only 1 coding variant in 2 other genes− PTEN, TP53

• The only errors in any reference data

we saw were in WCGs (but not GIAB)− 2 in NA19240, 1 in NA12892

− All errors in low-complexity sequence

• Many of the variants are repeated− Partly due to using related individuals

− Partly because most are common

polymorphisms

Limitations of the 7 WCGs WCGs All Others

APC 31 9

ATM 26 10

BMPR1A 7 1

BRCA1 21 162

BRCA2 39 156

BRIP1 23 5

CDH1 12 4

CDKN2A 3

CHEK2 4

EPCAM 8 1

MEN1 18 1

MET 18 2

MLH1 4 6

MSH2 4 8

MSH6 11 7

MUTYH 4 23

NBN 16 3

PALB2 8

PALLD 6 1

PMS2 16 9

PTCH1 10

PTEN 1 1

RAD51C 4

RET 27 2

SMAD4 3

TP53 1 3

VHL 7 1

Page 13: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 13

PALB2 in NA12878 (Get-RM browser)

Lots of GIAB variants but none are exonic

Page 14: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 14

CDKN2A in NA12878 (Get-RM browser)

Just one GIAB variant in 3’ UTR

(Similar situations in RAD51C, SMAD4)

Page 15: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• 304 of 310 sequence variants are SNVs

• 6 small deletions (max 4bp)

• 0 insertions

• 0 other variant types

• 0 variants in the most tricky regions for a Dx test

− Segdups, low-complexity, etc.

• No GIAB CNV data yet, but we’d expect 0 positives

• None of the WCG variants are clinically relevant

− None pathogenic or likely pathogenic under ACMG ISV criteria

− Unsurprisingly

• But Unfortunately….

Other Limitations of the WCGs in this study

Page 16: Jan2015 using the pilot genome rm for clinical validation steve lincoln

A Significant Fraction of Pathogenic Variants in

The Clinical Cases are Technically Challenging

Pathogenic and likely pathogenic variants (n=260) among the clinical cases

(n=1062) by variant type.

SNV34.2%

CNV multi-exon

4.6%

CNV single-exon3.8%Large

Indel3.5%Complex

1.5%

Small Indel

52.3%

Page 17: Jan2015 using the pilot genome rm for clinical validation steve lincoln

Examples

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 17

Page 18: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 18

BRCA1: c.1175_1214del40

Deletion

mapped

correctly in a

fraction of

reads

Split-read

signature in

additional

reads

Page 19: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 Copyright © Invitae Corp. All Rights Reserved19

BRCA2: c.9203del126

Split-read

signal at 3’

end of

deletion

Split-read

signal at 5’

end of

deletion

Exon target

Page 20: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 20

Deletion Affecting 2 Neighboring Exons

Split-read

signal at 3’

end of

deletion

Split-read

signal at 5’

end of

deletion

Exon Exon

Intron

Page 21: Jan2015 using the pilot genome rm for clinical validation steve lincoln

CDKN2A:c.9_32dup24

Lincoln et al., December 2014

Insertion of 3rd

repeat in correctly

mapped NGS reads

Repeat Copy 1 Repeat Copy 2

Split-read signal

from 3rd copy

(soft-clipped

reads)

Translation

5’ Met

Sup. Figures Page 21

Split-read signal

from 3rd copy

(soft-clipped

reads)

Page 22: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 22

BRCA2 c.156_insAlu

Split-read

signal of

Alu sequence

Page 23: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• Get IGV

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 23

MSH2:c.943+3T>C

Homopolymer-A

Alignment and

Biochemical

Artifacts

Page 24: Jan2015 using the pilot genome rm for clinical validation steve lincoln

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 24

SMAD4 Whole-Gene Duplication

Split-read signal

of neighboring

Exon equence

Ditto

Ditto

Ditto

Rare Pseudogene Insertion

Page 25: Jan2015 using the pilot genome rm for clinical validation steve lincoln

Lies, Damned Lies and Statistics*

• Imagine this validation study:− Test genes/exons of medical relevance in NA12878 (etc)

− Compare test results to GIAB reference data

− Count concordance, calculate sensitivity, specificity, and PPV

• Imagine an assay which silently fails to detect all “hard”

variants, but which works highly accurately on the “easy”

variants

• For the total spectrum of variants, sensitivity and specificity

will be over 99.9% for a large enough panel/study

• But among the truly positive patients there is a

>10% chance of a clinical false negative− In targeted and validated assay regions!

*Mark Twain

Page 26: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• Well characterized genomes, in particular NA12878 with

the GIAB data set, contributed significantly to the analytic

validation of a hereditary cancer panel test

• But there were important limitations:

− Few if any coding variants in some genes

o These are the majority of regions targeted by most Dx assays

− Few deletions (in these regions)

o No insertions in these regions

− Very few complex or “hard” variants, including

o Large indels

o Small CNVs

o Variants in medically relevant low complexity regions

o Other tricky stuff

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 26

Conclusion

Page 27: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• More samples with greater genetic diversity

− This is in process!

• CNV/SV maps

− This is in process!

• Fill in some regions currently missing data

− Suggestion: Prioritize coding regions of known disease genes

o There’s ~3,000 in total, ~700 generally used in Dx ~100 commonly used

• Engineered control with lots of “hard” variants

− In subsets of those known disease genes (commonly used ones)

− Genetically engineered cell lines or spike-ins?

• Data in transcript coordinates, using HGVS

2/9/2015 | Copyright © Invitae Corp. All Rights Reserved 27

Wish List for GIAB Reference Samples

Page 28: Jan2015 using the pilot genome rm for clinical validation steve lincoln

• Steve Lincoln

• Yuya Kobayashi

• Michael Anderson

• Shan Yang

• Kevin Jacobs

• Josh Paul

• Geoff Nilsen

• Jon Sorenson

• Federico Monzon

• Swaroop Aradhya

• Scott Topper

• Martin Powers

| Copyright © Invitae Corp. All Rights Reserved

Acknowledgements

• Jim Ford

• Allison Kurian

• Meredith Mills

• Leif Ellisen

• Andrea Desmond

• Michelle Gabree

• Kristen Shannon