Giab ashg 2017

Introduction

Genome in a Bottle:Towards benchmark structural variant calls

Justin Zook1,2, Lesley Chapman1,2, Noah Spies1,2, Marc Salit1,2, and the Genome in a Bottle Consortium(1) Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD and Stanford, CA

(2) Joint Initiative for Metrology in Biology, Stanford, CA

• NIST has hosted the Genome in a Bottle Consortium to develop well-characterized, whole human genome reference samples that are an enduring resource for benchmarking variant calls

Integrating data to form benchmark calls

Additional collection of public data underway!• PacBio Sequel of GIAB Chinese Trio• Oxford Nanopore ultralong reads of AJ Trio• New collaborations to characterize difficult regions and

variants in these genomes are welcome! Email [email protected] you’re interested

Challenges

Current focus on variants >=20bp

Zooketal.,ScientificData,2016.ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data

Developing Manual Curation &Visualization Tools

2012• Nohumanbenchmarkcallsavailable

• GIABConsortiumformed

2014• Smallvariantgenotypesfor~77%ofpilotgenomeNA12878

2015• NISTreleasesfirsthumangenomeReferenceMaterial

2016• 4newgenomes

• Smallvariantsfor90%of5genomesforGRCh37/38

2017+• Characteriz-ing difficultvariants

Discover

• Discoversequence-resolvedcallsfrommultipledatasets&analyses• Usingindel andSVcallers• Fromshort,long,andlinkedreads

Refinesequence

• Obtainsequence-resolvedcallsincandidateregions• SVrefine usesglobaldenovoassembliesfromshort,long,andlinkedreads• SevenBridgesgraphandSpiralGeneticsuseshortreadgraphs

Compareinputs

• Comparevariant andgenotypecallsfromdifferentmethods• UsingSVanalyzer tomeasuresimilarityofpredictedsequencechange• Accountsfordifferingrepresentationsintandemrepeats

Evaluate/genotype

• svviz – alignsreadstoREForALTtoproduceGTandvisualizatio• BioNano – comparesizeofdiscoveredcallstoourINS/DEL>1kb• Nabsys – alignsnanopore mapstoREForALTandusesanSVMtoclassifydeletions>300bpastrueorfalse

Identifyfeatures

• Identifyfeaturesassociatedwithreliabilityofcallsfromeachmethod• e.g.,countsofreadssupportingREF&ALT,#ofmismatches,clipping

Formbench-mark

• Initiallyusingheuristicslikesupportfrommultipletechnologiesorcallsets• Longertermusingmachinelearningwithmanymorefeaturesandtrainingdatafrommanualcuration

Compare

• Comparebenchmarksettohigh-qualitycallsets andexaminedifferences• Receivefeedbackfromthecommunityanditeratetoimprovecalls

DraftcallsandREADMEathttp://tinyurl.com/GIABSV0-4-0

• Large sequence-resolved insertions• Many fewer multi-kb insertions than multi-kb deletions

• Dense calls• ~1/3 v0.4.0 calls are within 1kb of a different call• Sequence-resolved insertion size doesn’t always match BioNano• Phasing will be important for these

• Calls with inaccurate or incomplete sequence change• Exploring training a model to predict sequence accuracy

• Homozygous Reference calls• Can we definitively state there is no SV?

• Benchmarking tool development• How to compare SVs to a benchmark?• What performance metrics are important?

• Display images from svviz, dotplots, IGV, gEVAL, etc. in web interface to “crowd-source” curation of GIAB benchmark calls

https://github.com/svviz/svviz

TandemRepeatO

verlap?

>1millioncallsfrom30+sequence-resolvedcallsetsfrom4techsforAJTrio

>500kuniquesequence-resolvedcalls

30kINSand32kDELwith2+techsor5+callerspredictingsequences<20%differentorBioNano/Nabsys support

28kINSand29kDELgenotypedbysvviz in

1+individuals

v0.4.0

InsertionsizecomparedtoBioNano

SizeDistribution

• Short reads• Illumina paired end• Illumina 6kb mate-pair• Complete Genomics

• Linked reads• 10X Genomics Chromium

• Long reads• PacBio• Oxford Nanopore (coming late 2017!)

• Marker-based• BioNano Genomics• Nabsys

Public data for Ashkenazi Trio

RelativeDistancefromexactmatch

Illuminalocal

assembly

PacBiorawread

PacBioconsensusassembly

Health & Medicine

Giab ashg 2017