Upload
genomeinabottle
View
42
Download
2
Embed Size (px)
Citation preview
Introduction
Genome in a Bottle:Towards benchmark structural variant calls
Justin Zook1,2, Lesley Chapman1,2, Noah Spies1,2, Marc Salit1,2, and the Genome in a Bottle Consortium(1) Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD and Stanford, CA
(2) Joint Initiative for Metrology in Biology, Stanford, CA
• NIST has hosted the Genome in a Bottle Consortium to develop well-characterized, whole human genome reference samples that are an enduring resource for benchmarking variant calls
Integrating data to form benchmark calls
Additional collection of public data underway!• PacBio Sequel of GIAB Chinese Trio• Oxford Nanopore ultralong reads of AJ Trio• New collaborations to characterize difficult regions and
variants in these genomes are welcome! Email [email protected] you’re interested
Challenges
Current focus on variants >=20bp
Zooketal.,ScientificData,2016.ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data
Developing Manual Curation &Visualization Tools
2012• Nohumanbenchmarkcallsavailable
• GIABConsortiumformed
2014• Smallvariantgenotypesfor~77%ofpilotgenomeNA12878
2015• NISTreleasesfirsthumangenomeReferenceMaterial
2016• 4newgenomes
• Smallvariantsfor90%of5genomesforGRCh37/38
2017+• Characteriz-ing difficultvariants
Discover
• Discoversequence-resolvedcallsfrommultipledatasets&analyses• Usingindel andSVcallers• Fromshort,long,andlinkedreads
Refinesequence
• Obtainsequence-resolvedcallsincandidateregions• SVrefine usesglobaldenovoassembliesfromshort,long,andlinkedreads• SevenBridgesgraphandSpiralGeneticsuseshortreadgraphs
Compareinputs
• Comparevariant andgenotypecallsfromdifferentmethods• UsingSVanalyzer tomeasuresimilarityofpredictedsequencechange• Accountsfordifferingrepresentationsintandemrepeats
Evaluate/genotype
• svviz – alignsreadstoREForALTtoproduceGTandvisualizatio• BioNano – comparesizeofdiscoveredcallstoourINS/DEL>1kb• Nabsys – alignsnanopore mapstoREForALTandusesanSVMtoclassifydeletions>300bpastrueorfalse
Identifyfeatures
• Identifyfeaturesassociatedwithreliabilityofcallsfromeachmethod• e.g.,countsofreadssupportingREF&ALT,#ofmismatches,clipping
Formbench-mark
• Initiallyusingheuristicslikesupportfrommultipletechnologiesorcallsets• Longertermusingmachinelearningwithmanymorefeaturesandtrainingdatafrommanualcuration
Compare
• Comparebenchmarksettohigh-qualitycallsets andexaminedifferences• Receivefeedbackfromthecommunityanditeratetoimprovecalls
DraftcallsandREADMEathttp://tinyurl.com/GIABSV0-4-0
• Large sequence-resolved insertions• Many fewer multi-kb insertions than multi-kb deletions
• Dense calls• ~1/3 v0.4.0 calls are within 1kb of a different call• Sequence-resolved insertion size doesn’t always match BioNano• Phasing will be important for these
• Calls with inaccurate or incomplete sequence change• Exploring training a model to predict sequence accuracy
• Homozygous Reference calls• Can we definitively state there is no SV?
• Benchmarking tool development• How to compare SVs to a benchmark?• What performance metrics are important?
• Display images from svviz, dotplots, IGV, gEVAL, etc. in web interface to “crowd-source” curation of GIAB benchmark calls
https://github.com/svviz/svviz
TandemRepeatO
verlap?
>1millioncallsfrom30+sequence-resolvedcallsetsfrom4techsforAJTrio
>500kuniquesequence-resolvedcalls
30kINSand32kDELwith2+techsor5+callerspredictingsequences<20%differentorBioNano/Nabsys support
28kINSand29kDELgenotypedbysvviz in
1+individuals
v0.4.0
InsertionsizecomparedtoBioNano
SizeDistribution
• Short reads• Illumina paired end• Illumina 6kb mate-pair• Complete Genomics
• Linked reads• 10X Genomics Chromium
• Long reads• PacBio• Oxford Nanopore (coming late 2017!)
• Marker-based• BioNano Genomics• Nabsys
Public data for Ashkenazi Trio
RelativeDistancefromexactmatch
Illuminalocal
assembly
PacBiorawread
PacBioconsensusassembly