24
© 2013 Real Time Genomics, Inc. NA12878 Trio/Pedigree Analysis Francisco M. De La Vega, D.Sc. VP Genome Science

Aug2013 real time genomics trio pedigree analysis

Embed Size (px)

Citation preview

Page 1: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

 NA12878  Trio/Pedigree  Analysis  

Francisco  M.  De  La  Vega,  D.Sc.  VP  Genome  Science  

Page 2: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Leveraging trio information •  GiaB has selected reference materials in the form of father,

mother, offspring trios •  The goal was to leverage the Mendelian inheritance patterns

to: –  Identify variant genotype errors that are inconsistent with

Mendelian inheritance –  Remove these errors from the reference baseline calls

•  However, if variant identification methods don't use directly pedigree information and jointly analyze the trio alignments, an opportunity to improve the genotype calls would be missed

•  We focused on using the RTG Family caller to better leverage the shared information in the trios and improve the call set, whilst reducing Mendelian inconsistent genotype errors

Page 3: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

C A A

A A A A A A

A

A A

A

A A /Genotype:

A A

C A

C

C A A

A A

A /Genotype: C

C

A /Genotype:

A C

C

C

|

||

Variant calling can be improved by jointly analyzing related samples

Shared  haplotypes  

Page 4: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

C A A

A A A A A A

A

A A

A

A A /Genotype:

A A

C A

C

C A A

A A

A /Genotype: C

C

A /Genotype:

A C

C

C

|

||

Variant calling can be improved by jointly analyzing related samples

Mendelian  variant  segregaJon  

Shared  haplotypes  

Page 5: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Mendelian inconsistency

C

C

/Genotype: C

C

C

C C

C C

A A A

A A /Genotype: (Low QV)

C A

A

A A

A

A /Genotype:

C

C

C A

A A

C C

A C

|

||

Page 6: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Joint trio analysis corrects Mendelian errors

C

/Genotype: C

C

C

C

C T

G

G

G

C T C T

C T

C A

A

A A

A

Genotype:

C

A / C

G

G G G G G

G A A A

Genotype: (Good QV)

C T C T C T C T

A / C

G G G A A

C C

A C

|

||

Page 7: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

NA12878 calls from trio calling

•  Comparing offspring variants from singleton vs pedigree calling –  Both showing good quality metrics

•  Using family information more good calls can be made and dubious calls are downgraded

NA12878    Call  set SNVs Indels MNPs

SNV  Het/Hom Ti/Tv  

%  dbSNP  (r129)

RTG  single   3,329,797 558,242 31,070 1.55   2.11   90.8%  

RTG  trio   3,363,619 595,030 33,686 1.57   2.11   90.4%  

GATK/VQSR     3,263,289 610,837 N/A 1.51   2.09   91.7%  

Variant  StaBsBcs  

Data:  WGS  2x100bp  >50X    Illumina  PlaJnum  Genomes  data  (ENA  Acc.  No.  ERP001960).  RTG  AVR  score  cut-­‐off  0.15;  GATK  v1.7  &  BWA  0.6.1.  

142,848  

68,000  

Family  

Singleton  

3,849,457  

NA12878

NA12891 NA12892

Page 8: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

NA12878 vs reference datasets

NA12878    Call  set

1kP  OMNI    Poly  (TP%)  

1kP  OMNI    Mono  (FP%)  

Get-­‐RM¶  

(TP  %)  GiaB  (TP%)  

GiaB-­‐BED  (TP%)  

RTG  single   97.5%   0.10%   97.4%   N/A   N/A  

RTG  trio   97.5%   0.24%   97.0%   90.5%   94.1%  

GATK/VQSR     97.8%   0.17%   87.8%   88.4%   92.5%  

§  RelaJve  to  dbSNP  137;  StaJsJcs  for  SNVs  only.  ¶Get-­‐RM  consistent  high-­‐quality  variants;  n=498    

NA12878

NA12891 NA12892

–  1000 Genomes Illumina OMNI SNP array •  Polymorphic sites – TP proxy •  Monomorphic sites – FP proxy

–  Get-RM high confidence call set –  GiaB high confidence calls in BED region

Page 9: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

ROC Trio calls vs. GiAB baseline (BED)

RTG  snpsimeval  tool;  SNV/indel/MNP;  zygosity  match    

Page 10: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

ROC Trio calls vs. GiaB baseline

RTG  snpsimeval  tool;  SNV/indel/MNP;  zygosity  match    

Page 11: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

ROC Trio calls vs. CGI baseline

RTG  snpsimeval  tool;  SNV/indel/MNP;  zygosity  match    

Page 12: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Mendelian inconsistency errors

RTG family caller reduces Mendelian Inheritance Errors over 60X vs. RTG singleton calling (over 70X vs. GATK/VQSR)

Log  Co

unts  of  M

IE  

1  

10  

100  

1000  

10000  

100000  

1000000  

RTG  single   RTG  trio   GATK/VQSR  

335,625  

4,870  

351,904  

Page 13: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Pattern #1: Heterozygous variant

Trio Calling

NA12878

NA12892NA12891

NA12877

NA12889 NA12890

NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893

0/1

0/10/0

0/0 0/0 0/00/0 0/00/1 0/1 0/10/10/1    

Page 14: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Segregation of heterozygous variants

0  

20,000  

40,000  

60,000  

80,000  

1   2   3   4   5   6   7   8   9   10   11  

SNV  coun

t  

#  of  offspring  segregaBng  

SNV  

0  

100  

200  

300  

400  

500  

1   2   3   4   5   6   7   8   9   10   11  

MNP  coun

t  

#  of  offspring  segregaBng  

MNP  

0  

2,000  

4,000  

6,000  

8,000  

10,000  

1   2   3   4   5   6   7   8   9   10   11  

inde

l    coun

t  

#  of  offspring  segregaBng  

indel  

0  

20,000  

40,000  

60,000  

80,000  

100,000  

1   2   3   4   5   6   7   8   9   10   11  

Varia

nt  cou

nt  

#  of    offspirng  segregaBng  

All  Variants  

SegregaJon  of  NA12878  heterozygous  variants  called  as  family,  GQ>50,  homozygous  reference  in  other  parent.  

Page 15: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Pattern #2: Homozygous-alt variant

Trio Calling

NA12878

NA12892NA12891

NA12877

NA12889 NA12890

NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893

0/1

1/10/0

0/1 0/1 0/10/10/10/1 0/1 0/1 0/1 0/1    

Page 16: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Segregation of homo-alt variants

0  

20,000  

40,000  

60,000  

80,000  

100,000  

120,000  

1   2   3   4   5   6   7   8   9   10   11  

SNV  coun

t  

#  of  offspring  segregaBng  

SNV  

0  

100  

200  

300  

400  

500  

600  

700  

1   2   3   4   5   6   7   8   9   10   11  

MNP  coun

t  

#  of  offspring  segregaBng  

MNP  

0  

2,000  

4,000  

6,000  

8,000  

10,000  

12,000  

1   2   3   4   5   6   7   8   9   10   11  

inde

l  cou

nt  

#  of  offspring  segregaBng  

indel  

0  

20,000  

40,000  

60,000  

80,000  

100,000  

120,000  

1   2   3   4   5   6   7   8   9   10   11  

Varia

nt  cou

nt  

#  of  offspring  segregaBng  

All  Variants  

SegregaJon  of  NA12878  homozygous  alternaJve  variants  called  as  family,  GQ>50,  homozygous  reference  in  other  parent.  

Page 17: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

False positive estimate by segregation  GT  Type   All  variants   SNV   MNP   indel    Het  

TP  (10-­‐11)   123672   110262   693   12717  

FP  (1-­‐8)   1901   1000   47   854  

FP%   1.40%   0.88%   1.42%   5.67%    Homo-­‐alt  

TP  (2-­‐10)   373260   329642   2258   41360  

FP  (1,11)   4457   3672   36   749  

FP%   1.18%   1.10%   1.57%   1.78%    Overall  

TP   496932   439904   2951   54077  

FP   6358   4672   83   1603  Overall  FP%   1.26%   1.05%   2.74%   2.88%  

Page 18: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Data imputation by pedigree caller

•  For genomes with no data use population priors –  With care can iterate over offspring then each of parents

independently –  Avoid exponential explosion so can do whole extended

family in one calling step

Page 19: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Imputation of family members with no data

Simulated  data      

True

 PosiJves  

False  PosiJves  

1  offspring  

2  offspring  

4  offspring  

4  offspring  +  father  

Page 20: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

ROC vs NA12878 imputed baseline

RTG  snpsimeval  tool;  SNV/indel/MNP;  zygosity  match    

Page 21: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

de novo mutation identification

Call  set de  novo  

candidates de  novo  germline*  

de  novo  somaBc*   TP/FP  

Singleton  calls 16,902 49  (100%)   941  (99%)   1:17  

Trio  calls 2,205 49  (100%)   941  (99%)   1:2.2  

de  novo  MutaBon  Accuracy  (NA12878)  

*SensiJvity  vs.  Conrad  et  al.  (2011)  validated  dataset  of  germline  and  somaJc  cell  line  de  novo  mutaJons.  

–  Uses the parental genomes to identify & score de novo mutations in offspring

–  Greater than 7X improvement in precision to find de novo mutations vs. naïve methods

NA12878

NA12891 NA12892

Page 22: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Status

•  Working through the complete trio datasets for producing joint pedigree calls for NA12878 trio – Aiming for a trio call set and another that

includes full Platinum pedigree data – There is disproportionally more data for

NA12878 than her parents or offspring •  Comprehensive segregation analysis that

includes all Mendelian patterns •  Phasing analysis to identify variants that are

inconsistent with transmitted phases

Page 23: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Issues

•  How to integrate pedigree calls with other data? – Variants that segregate appropriately

candidates for inclusion in baseline – Variants that don’t segregate appropriately

candidates for removal of baseline –  Improvement of baseline genotypes using

pedigree-based genotypes •  Use of the imputed NA12878 baseline •  Creation of a more inclusive baseline for ROC

curves to compare new methods and select thresholds

Page 24: Aug2013 real time genomics trio pedigree analysis

©  2013  Real  Time  Genomics,  Inc.    

Acknowledgements

•  RTG team at Hamilton, New Zealand –  Led by John Cleary, CTO

•  RTG team at San Bruno, CA –  Sahar Malakshah –  Minita Shah –  Brian Hilbush

•  Michael Eberle, Illumina, Inc. – Platinum Data •  Justin Zook, NIST •  1000 Genomes Project

©  2013  Real  Time  Genomics,  Inc.  All  rights  reserved.  US  Patent  7,640,256.  Other  patents  pending.  For  research  use  only.  Not  for  diagnosJc  applicaJons.