36
Genome in a Bottle Workshop Justin Zook and Marc Salit NIST Genome-Scale Measurements Group JIMB September 16, 2016

Sept2016 plenary nist_intro

Embed Size (px)

Citation preview

Page 1: Sept2016 plenary nist_intro

Genome in a Bottle Workshop

Justin Zook and Marc SalitNIST Genome-Scale Measurements

GroupJIMB

September 16, 2016

Page 2: Sept2016 plenary nist_intro

WELCOME

Page 3: Sept2016 plenary nist_intro
Page 4: Sept2016 plenary nist_intro

Today we’re releasing 4 new GIAB RM Genomes.

• PGP Human Genomes– AJ son– AJ trio– Asian son

• Parents also characterized

• Available immediately

Page 5: Sept2016 plenary nist_intro

Today we’re releasing 4 new GIAB RM Genomes.

• New, reproducible methods applied to characterize high-confidence SNPs/indels in 85-90% of each genome

Page 6: Sept2016 plenary nist_intro

We’re also releasing a Microbial Genome RM

This Reference Material (RM) is intended for validation, optimization, process evaluation, and performance assessment of whole genome sequencing.

• Salmonella Typhimurium • Pseudomonas

aeruginosa • Staphylococcus aureus• Clostridium sporogenes

Page 7: Sept2016 plenary nist_intro
Page 8: Sept2016 plenary nist_intro

What’s JIMB?• Joint Initiative for

Metrology in Biology– develop standards,

methods, tools and measurement science

– make biology easier to engineer

– make reproducibility and reliability easier• lower barriers to translation

of innovation• enable scaling through

distribution of labor

Faculty• Science• Technology

Development• Innovation

NIST • Metrology• Standards

Realization Lab• Measurement

Science

Trainees• Postdocs• Coursework• Graduate

Trainees

Commercial• Customers• Technology• Metrology

Training• Workforce

Page 9: Sept2016 plenary nist_intro

is Genomics and Synthetic Biology.

DNA Read and Write.

Page 10: Sept2016 plenary nist_intro

Genome in a Bottle ConsortiumWhole Genome Variant Calling

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance– materials certified for their

variants against a reference sequence, with confidence estimates

• established consortium to develop reference materials, data, methods, performance metrics

• Characterized Pilot Genome NA12878

• Ashkenazim Trio, Asian son from PGP released today!

gene

ric m

easu

rem

ent p

roce

ss

Page 11: Sept2016 plenary nist_intro

Bringing Principles of Metrologyto the Genome

• Reference materials– DNA in a tube you can buy from

NIST– NA12878 pilot sample, now 2

PGP-sourced trios• Extensive state-of-the-art

characterization– as good as we can get for small

variants– arbitrated “gold standard” calls

for SNPs, small indels• “Upgradable” as technology

develops

• Analysis of all samples ongoing as technology develops

• PGP genomes suitable for commercial derived products

• Developing benchmarking tools and software– with GA4GH

• Samples being used to develop and demonstrate new technology

Page 12: Sept2016 plenary nist_intro

We are liaising with…• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Genome Reference Consortium• 1000 Genomes SV group• CAP/CLIA

• Global Alliance for Genomics and Health Benchmarking Team• ABRF• FDA• SEQC• Global metrology system

Page 13: Sept2016 plenary nist_intro

AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions

– Topic #1: Moving beyond the 'easy' variants and regions of the genome

– Topic #2: Selecting future genomes for Reference Materials

Tuesday• Breakfast and registration• Use cases: Experiences using the pilot

Reference Material• Discussion of plans to release pilot

Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans

and discussion• Steering committee Overview• First meeting of the Steering

Committee (others adjourn)

Please Note

Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).

Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.

Page 14: Sept2016 plenary nist_intro

NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

Page 15: Sept2016 plenary nist_intro

Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…

Illumina Paired-end WGS

150x150bp250x250bp

~300x/individual~50x/individual

on SRA/FTP SNPs/indels/some SVs

Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs

SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs

Illumina Paired-end WES

100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome

Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome

Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs

Illumina “moleculo” Custom library ~30x by long fragments

on FTP SVs/phasing/assembly

Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing

10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly

PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent

on SRA/FTP SVs/phasing/assembly/STRs

Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly

Nabsys 2.0 ~100kbp N50 nanopore maps

70x on AJ son SVs/assembly

BioNano Genomics 200-250kbp optical map reads

~100x/AJ individual; 57x on Asian son

on FTP SVs/assembly

Page 16: Sept2016 plenary nist_intro

Dataset AJ Son AJ Parents Chinese son Chinese parents

NA12878

Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X

Page 17: Sept2016 plenary nist_intro

Paper describing data…51 authors14 institutions12 datasets7 genomesData described in ISA-tab

Page 18: Sept2016 plenary nist_intro

0

20000

40000

60000

80000

100000

120000

140000

0

200

400

600

800

1000

1200

1400

1600

1800

2000

GIAB ftp site downloads/unique-IPs by month

Month

# do

wnl

oads

# IP

s

Page 19: Sept2016 plenary nist_intro

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Page 20: Sept2016 plenary nist_intro

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

NEW: Reproducible

integration pipeline

with new calls for

NA12878 and PGP

Trios!

Page 21: Sept2016 plenary nist_intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

Page 22: Sept2016 plenary nist_intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

More high-confidence calls match Platinum Genomes

Page 23: Sept2016 plenary nist_intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

Similar extra calls not in Platinum Genomes

Page 24: Sept2016 plenary nist_intro

New calls (v3.3) vs. old calls (v2.19)

V3.3• 3441361 match PG• 550982 PG calls outside

high conf• 124715 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls

V2.19 • 3030717 match PG• 1018795 PG calls outside

high conf• 122359 calls not in PG• After excluding low

confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls

~80% fewer differences from PG in high confidence regions

Page 25: Sept2016 plenary nist_intro

New calls (v3.3) vs. old calls (v2.19)Example vcf (verily) Stratified

V3.3• 17% of SNPs not assessed

– 23% of SNPs in RefSeq coding– 53% of SNPs in “bad

promoters”• 78% of indels not assessed

– 0.7% difference rate• 17% FP in regions

homologous to decoy

V2.19 • 27% of SNPs not assessed

– 36% of SNPs in RefSeq coding– 82% of SNPs in “bad

promoters”• 78% of indels not assessed

– 1.2% difference rate• 0.2% FP in regions

homologous to decoy

Page 26: Sept2016 plenary nist_intro

Principles of Integration Process

• Form sensitive variant calls from each dataset

• Define “callable regions” for each callset

• Filter calls from each method with annotations unlike concordant calls

• Compare high-confidence calls to other callsets and manually inspect subset of differences– vs. pedigree-based calls– vs. common pipelines– Trio analysis

• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs

Page 27: Sept2016 plenary nist_intro

Criteria for including new callsets• Form sensitive variant

calls from each dataset• Define “callable regions”

for each callset• Good coverage and MapQ• Use knowledge about

technology and manual inspection to exclude repetitive regions difficult for each dataset

• For new callsets, ensure most FNs in callable regions relative to current high-confidence calls are questionable in the current calls

• Filter calls from each method with annotations unlike concordant calls– Annotations for which

outliers are expected to indicate bias should be selected for each callset

Page 28: Sept2016 plenary nist_intro

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools• Integrated into a single

framework with standardized inputs and outputs

• Standardized bed files with difficult genome contexts for stratification

Credit: GA4GH, Abby Beeler, Ellie Wood

Stratification of FP RatesHigher FP rates at Tandem Repeats

https://github.com/ga4gh/benchmarking-tools

Page 29: Sept2016 plenary nist_intro

Benchmarking Tools

Standardized comparison, counting, and stratification with Hap.py + vcfeval

https://precision.fda.gov/ https://github.com/ga4gh/benchmarking-tools

Page 30: Sept2016 plenary nist_intro

Microbial Genomic RM Characterization

PEPR Workflow

https://github.com/usnistgov/pepr

Page 31: Sept2016 plenary nist_intro

Acknowledgements

• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA– Liz Mansfield– Zivana Tevak– David Litwack

Page 32: Sept2016 plenary nist_intro

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://www.nature.com/articles/sdata201625

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Public workshops – Next one Sep 15-16 at NIST, MD, USA

NIST postdoc opportunities available! Justin Zook: [email protected] Salit: [email protected]

NIST Microbial RMs

talk by Jason Kralj

tomorrow at 5:15pm

Page 33: Sept2016 plenary nist_intro

Clinical Genome Sequencing Process

Preanalytical

Sequencing

Sequence Bioinformatics

Functional Variant Annotation

Clinical Variant Knowledgebase

Query

Clinical Interpretation Reporting

EHR Archival

Page 34: Sept2016 plenary nist_intro

What is the standards architecture to demonstrate safety and efficacy?

Preanalytical

Sequencing

Sequence Bioinformatics

Functional Variant Annotation

Clinical Variant Knowledgebase

Query

Clinical Interpretation Reporting

EHR Archival

Page 35: Sept2016 plenary nist_intro

Analytical/Technical PerformanceAssessment

Preanalytical

Sequencing

Sequence Bioinformatics

Functional Variant Annotation

Clinical Variant Knowledgebase

Query

Clinical Interpretation Reporting

EHR Archival

Page 36: Sept2016 plenary nist_intro