1
Introduction Genome in a Bottle: Towards benchmark structural variant calls Justin Zook 1,2 , Lesley Chapman 1,2 , Noah Spies 1,2 , Marc Salit 1,2 , and the Genome in a Bottle Consortium (1) Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD and Stanford, CA (2) Joint Initiative for Metrology in Biology, Stanford, CA NIST has hosted the Genome in a Bottle Consortium to develop well- characterized, whole human genome reference samples that are an enduring resource for benchmarking variant calls Integrating data to form benchmark calls Additional collection of public data underway! PacBio Sequel of GIAB Chinese Trio Oxford Nanopore ultralong reads of AJ Trio New collaborations to characterize difficult regions and variants in these genomes are welcome! Email [email protected] if you’re interested Challenges Current focus on variants >=20bp Zook et al., Scientific Data, 2016. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data Developing Manual Curation & Visualization Tools 2012 • No human benchmark calls available • GIAB Consortium formed 2014 • Small variant genotypes for ~77% of pilot genome NA12878 2015 • NIST releases first human genome Reference Material 2016 • 4 new genomes • Small variants for 90% of 5 genomes for GRCh37/38 2017+ • Characteriz- ing difficult variants Discover • Discover sequence-resolved calls from multiple datasets & analyses • Using indel and SV callers • From short, long, and linked reads Refine sequence • Obtain sequence-resolved calls in candidate regions • SVrefine uses global de novo assemblies from short, long, and linked reads • Seven Bridges graph and Spiral Genetics use short read graphs Compare inputs • Comparevariant and genotype calls from different methods • Using SVanalyzer to measure similarity of predicted sequence change • Accounts for differing representations in tandem repeats Evaluate/ genotype • svviz – aligns reads to REF or ALT to produce GT and visualizatio • BioNano – compare size of discovered calls to our INS/DEL >1kb • Nabsys – aligns nanopore maps to REF or ALT and uses an SVM to classify deletions >300bp as true or false Identify features • Identify features associated with reliability of calls from each method • e.g., counts of reads supporting REF & ALT, # of mismatches, clipping Form bench- mark • Initially using heuristics like support from multiple technologies or callsets • Longer term using machine learning with many more features and training data from manual curation Compare • Compare benchmark set to high-quality callsets and examine differences • Receive feedback from the community and iterate to improve calls Draft calls and README at http://tinyurl.com/GIABSV0-4-0 Large sequence-resolved insertions Many fewer multi-kb insertions than multi-kb deletions Dense calls ~1/3 v0.4.0 calls are within 1kb of a different call Sequence-resolved insertion size doesn’t always match BioNano Phasing will be important for these Calls with inaccurate or incomplete sequence change Exploring training a model to predict sequence accuracy Homozygous Reference calls Can we definitively state there is no SV? Benchmarking tool development How to compare SVs to a benchmark? What performance metrics are important? Display images from svviz, dotplots, IGV, gEVAL, etc. in web interface to “crowd-source” curation of GIAB benchmark calls https://github.com/svviz/svviz Tandem Repeat Overlap? >1 million calls from 30+ sequence-resolved callsets from 4 techs for AJ Trio >500k unique sequence-resolved calls 30k INS and 32k DEL with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support 28k INS and 29k DEL genotyped by svviz in 1+ individuals v0.4.0 Insertion size compared to BioNano Size Distribution Short reads Illumina paired end Illumina 6kb mate-pair Complete Genomics Linked reads 10X Genomics Chromium Long reads PacBio Oxford Nanopore (coming late 2017!) Marker-based BioNano Genomics Nabsys Public data for Ashkenazi Trio Relative Distance from exact match Illumina local assembly PacBio raw read PacBio consensus assembly

Giab ashg 2017

Embed Size (px)

Citation preview

Page 1: Giab ashg 2017

Introduction

Genome in a Bottle:Towards benchmark structural variant calls

Justin Zook1,2, Lesley Chapman1,2, Noah Spies1,2, Marc Salit1,2, and the Genome in a Bottle Consortium(1) Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD and Stanford, CA

(2) Joint Initiative for Metrology in Biology, Stanford, CA

• NIST has hosted the Genome in a Bottle Consortium to develop well-characterized, whole human genome reference samples that are an enduring resource for benchmarking variant calls

Integrating data to form benchmark calls

Additional collection of public data underway!• PacBio Sequel of GIAB Chinese Trio• Oxford Nanopore ultralong reads of AJ Trio• New collaborations to characterize difficult regions and

variants in these genomes are welcome! Email [email protected] you’re interested

Challenges

Current focus on variants >=20bp

Zooketal.,ScientificData,2016.ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data

Developing Manual Curation &Visualization Tools

2012• Nohumanbenchmarkcallsavailable

• GIABConsortiumformed

2014• Smallvariantgenotypesfor~77%ofpilotgenomeNA12878

2015• NISTreleasesfirsthumangenomeReferenceMaterial

2016• 4newgenomes

• Smallvariantsfor90%of5genomesforGRCh37/38

2017+• Characteriz-ing difficultvariants

Discover

• Discoversequence-resolvedcallsfrommultipledatasets&analyses• Usingindel andSVcallers• Fromshort,long,andlinkedreads

Refinesequence

• Obtainsequence-resolvedcallsincandidateregions• SVrefine usesglobaldenovoassembliesfromshort,long,andlinkedreads• SevenBridgesgraphandSpiralGeneticsuseshortreadgraphs

Compareinputs

• Comparevariant andgenotypecallsfromdifferentmethods• UsingSVanalyzer tomeasuresimilarityofpredictedsequencechange• Accountsfordifferingrepresentationsintandemrepeats

Evaluate/genotype

• svviz – alignsreadstoREForALTtoproduceGTandvisualizatio• BioNano – comparesizeofdiscoveredcallstoourINS/DEL>1kb• Nabsys – alignsnanopore mapstoREForALTandusesanSVMtoclassifydeletions>300bpastrueorfalse

Identifyfeatures

• Identifyfeaturesassociatedwithreliabilityofcallsfromeachmethod• e.g.,countsofreadssupportingREF&ALT,#ofmismatches,clipping

Formbench-mark

• Initiallyusingheuristicslikesupportfrommultipletechnologiesorcallsets• Longertermusingmachinelearningwithmanymorefeaturesandtrainingdatafrommanualcuration

Compare

• Comparebenchmarksettohigh-qualitycallsets andexaminedifferences• Receivefeedbackfromthecommunityanditeratetoimprovecalls

DraftcallsandREADMEathttp://tinyurl.com/GIABSV0-4-0

• Large sequence-resolved insertions• Many fewer multi-kb insertions than multi-kb deletions

• Dense calls• ~1/3 v0.4.0 calls are within 1kb of a different call• Sequence-resolved insertion size doesn’t always match BioNano• Phasing will be important for these

• Calls with inaccurate or incomplete sequence change• Exploring training a model to predict sequence accuracy

• Homozygous Reference calls• Can we definitively state there is no SV?

• Benchmarking tool development• How to compare SVs to a benchmark?• What performance metrics are important?

• Display images from svviz, dotplots, IGV, gEVAL, etc. in web interface to “crowd-source” curation of GIAB benchmark calls

https://github.com/svviz/svviz

TandemRepeatO

verlap?

>1millioncallsfrom30+sequence-resolvedcallsetsfrom4techsforAJTrio

>500kuniquesequence-resolvedcalls

30kINSand32kDELwith2+techsor5+callerspredictingsequences<20%differentorBioNano/Nabsys support

28kINSand29kDELgenotypedbysvviz in

1+individuals

v0.4.0

InsertionsizecomparedtoBioNano

SizeDistribution

• Short reads• Illumina paired end• Illumina 6kb mate-pair• Complete Genomics

• Linked reads• 10X Genomics Chromium

• Long reads• PacBio• Oxford Nanopore (coming late 2017!)

• Marker-based• BioNano Genomics• Nabsys

Public data for Ashkenazi Trio

RelativeDistancefromexactmatch

Illuminalocal

assembly

PacBiorawread

PacBioconsensusassembly