36
GIAB Update and Roadmap Jan 25, 2018

Giab workshop intro 180125

Embed Size (px)

Citation preview

GIAB Update and Roadmap

Jan 25, 2018

What’s Genome in a Bottle?• Authoritative Characterization

of Human Genomes– enduring commitment to

resource availability• Samples• Data

– widely available open resources

– no restrictions on use or distribution

• Enable technology and tool-building with benchmark samples and methods for…– development– optimization– demonstration

• Germline samples available now

• Developing capacity for somatic sample development

What GIAB Isn’t

• Population genetics

• Disease-specific

• Many clinical samples

• Non-human

• Genome, not transcriptome, epigenome, proteome, metabolome…

Prior workshop takeaways

http://jimb.stanford.edu/giabworkshops

UPDATE SINCE SEPTEMBER ‘16

Benchmarking strides forward

• Draft GA4GH manuscript describing best practices for benchmarking germline small variants

– >15 co-authors actively editing manuscript

• Robust, sophisticated benchmarking tools publicly available

– GitHub

– PrecisionFDA

High-confidence calls are in use

• 286 citations of 2014 paper

• PrecisionFDA challenges

• Clinical labs

• Demonstration of new variant callers

https://blog.dnanexus.com/2017-12-05-evaluating-

deepvariant-googles-machine-learning-variant-caller/

Clinical Community adopting GIAB

• Justin in a 2-year term on Association for Molecular Pathology Clinical Practice Committee

• Monica Basehore appointedGIAB/AMP liaison

• GIAB derived products meeting needs in clinical labs– AMP RM Forum hosted

by CDC• Somatic variants

• ctDNA

• Difficult variants

Derived Products are on the market

• 31 products from 3 companies now available based on GIAB PGP cell lines

• DNA + spike-ins

– Clinical variants

– Somatic variants

– Difficult variants

• FFPE

• ctDNA

Open data is being used!• 82 citations of Scientific

Data paper

GIAB Developing New Data• 10X Genomics

– Chinese trio now available

• PacBio Sequel of Chinese trio with Mt Sinai– Read insert N50: ~15kb– 202Gb son and 98-116Gb on each

parent– Data undergoing QC

• BioNano– New DLS labeling method

• Complete Genomics/BGI– stLFR linked reads

• Oxford Nanopore– NIST/Birmingham/

Nottingham Ultra-long reads• Starting soon• Very preliminarily 80-90kb N50

– Max reads >1Mb!

• Current throughput will give ~40x total on AJ trio, but may improve

• Strand-seq– Collaboration with Korbel lab

Progress on New SamplesGermline Samples• Performed rfmix ancestry analysis on 6

PGP individuals with WGS and cell lines– 2 differently admixed Hispanic

– 1 76% African + 24% European

– 1 84% European + 15% South Asian

– 1 77% European + 21% East Asian

– 1 99% European

– PGP1

– 1 self-reported Chinese/Filipino

• Working on MTA for open dissemination

Somatic Samples

• Discussion Friday Morning

Goals for This Workshop• Update consortium and

onboard new members

• Review progress on SVs

• Demo Manual Curation App

• Learn about new methods for characterizing difficult regions

• Review, revisit, and update Principles for Dissemination of GIAB Samples

• Discuss plans for new Germline and Tumor Samples

Workshop Agenda• THURSDAY, JANUARY 25, 2018• 9:00 AM - 10:30 AM: Welcome, Onboarding, and

GIAB Progress Update• 10:30 AM - 11:00 AM: Break• 11:00 AM - 12:15 PM: Training and Trial Manual

Curation of Structural Variants• 12:15 PM - 1:45 PM: Lunch• 1:45 PM - 3:15 PM: Feedback about v0.5 Draft

Structural Variant Benchmark Set• 3:15 PM - 3:45 PM: Break• 3:45 PM – 5:00 PM: New Approaches to

Characterizing Difficult Variants and Regions• 5:00 PM - 5:30 PM: Discussion of Future Work• 5:30 PM - 6:30 PM: Happy Hour, Sponsored by

PacBio and Invitae

• FRIDAY, JANUARY 26, 2018• 9:00 AM - 10:45 AM: Panel Discussion about

Principles for Dissemination of GIAB Samples• 10:45 AM - 11:15 AM: Break• 11:15 AM - 12:00 PM: Discussion about Future

Germline and Tumor Samples• 12:00 PM - 1:00 PM: Break• 1:00 PM - 2:30 PM: GIAB Steering Committee

Meeting

Steering Committee Agenda• Roadmap• Next 2-3 workshops• Resourcing GIAB work• Communications

– Best practices– Ways of working together

• Liaisons– HGSVC– Clinical labs

• Samples, consents, repository relationships– NIST RMs needed?– Distribute all NIST RMs

together– Cells instead of DNA– GIAB Imprimatur

• Research v. Standards-making– Tool development vs reference

sample development

GIAB Roadmap

• Develop open-access samples and data for broad uses in industry, academia, and government

• Convene community of experts to characterize genomes -> GIAB/NIST integrates results to form benchmarks

• Develop tools to calculate accurate and standardized performance metrics

Unique GIAB roles in genomics

• New sequencing with long and linked reads

• Developing plan for open access to GIAB materials

• Selecting samples from new ancestries

• Developing cancer samples for somatic benchmarking

• Stay for Friday morning!

Progress Update

• Developing cancer samples for somatic benchmarking

• Draft publications about high-confidence calls and benchmarking methods

Ongoing and Future Work

• Best methods agree on 99.9%+ of “easy” calls

• Evaluating “straw man” large indel/SV callsets

Progress Update

• Characterize challenging 10-20% of genome

• New methods for reference characterization of somatic genomes

• Refining principled integration methods

• Assembly metrology

Ongoing and Future Work

• GIAB Analysis Team focused on large indels and SVs

Progress Update

• Individual collaborations exploring expanding calls for other variant types

Ongoing and Future Work

• Released 3 “straw man” sequence-resolved benchmark callsets >=20bp

• Analysis Team gave critical feedback in each round

• V0.5.0 released Jan 2018

Progress Update

• Evaluate v0.5.0• Write manuscript• Manual Curation• Resolve clusters of variants• Integrate new technologies

and methods

Ongoing and Future Work

Our SV Integration Strategy

Collect many candidate calls for AJ Trio

• Gather candidate calls from a variety of approaches– Many technologies

• Short, linked, and long reads• Optical and nanopore mapping

– Many approaches• Small variant callers• Structural variant callers• Local and global de novo assemblies

• Community submitted >1 million calls from 30+ methods using 5+ technologies

Refine/evaluate/genotype candidates

• Obtain sequence-resolved calls as often as possible using assembly-based approaches

• Compare sequence predictions of candidate calls and merge similar calls

• Determine raw data’s support of each sequence-resolved call and its genotype

Evolution of SV calls for AJ Triov0.2.0

• Only deletions

• Overlap and size-based clustering

• Output sites with multitechsupport

v0.3.0

• New calling methods

• Deletions and insertions

• Sequence-resolved calls

• Sequence-based clustering

• Output sites with multitechsupport

v0.4.0

• Include some single tech calls

• Evaluate read support to remove some false positives

• Add genotypes for trio

v0.5.0

• Better calling methods, especially for large insertions

• Include more single tech calls

• Add some phasing info

Future

• Resolve clusters of differing calls

• Improve phasing

• Add new data types

• Improve sequence resolution

• Collaborate with HGSVC?

• Initiated discussions with several groups working on phasing and calling variant in difficult to map regions

• Similar data and methods used for both problems

Progress Update

• Work with several groups developing new methods

• Integrate difficult to map variants into high-confidence calls

• Integrate phasing into high-confidence calls

Ongoing and Future Work

• Initiated discussions with several groups working on short tandem repeats and complex variants

• Explored using RTG vcfevaland varmatch to harmonize multiple vcfs for integration

Progress Update

• Add STR methods into integration methods

• Test variant harmonization methods for integration

• Find collaborators for HLA and ALT loci characterization (e.g., graph-based methods, linked/long reads)

Ongoing and Future Work

• Draft manuscript for v3.3.2 small variants

• Preliminary machine learning methods can reproduce SV genotypes from svviz

• Demo of SV manual curation web app (see next session!)

Progress Update

• Crowd-source manual curation of SVs

• Use crowd-sourced labels for machine learning

Ongoing and Future Work

• Using assemblies to call and refine structural variants

Progress Update

• Need to develop integration methods for all types of somatic variants

• Need to develop methods to integrate and benchmark diploid assemblies

Ongoing and Future Work

• GA4GH made available sophisticated, standardized tools for benchmarking small variants

• “Best practices” manuscript for small variant benchmarking

Progress Update

• Develop new methods for structural variant benchmarking

• Develop new methods for somatic variant benchmarking

• Predict performance on clinically interesting variants

Ongoing and Future Work

Benchmarking Best Practices Manuscript

• Focus on germline small variants

• Describe benchmark callsets

• Define performance metrics at different stringency levels

• Sophisticated comparison tools are important

• Stratify performance by variant type and genome context

• Tools available on GitHub and PrecisionFDA

The road ahead...2018

• Large variants

• Difficult small variants

• Phasing

2019

• Difficult large variants

• Somatic sample development

• Germline samples from new ancestries

2020+

• Diploid assembly

• Somatic structural variation

• Segmental duplications

• Centromere/ telomere

• ...

Extra slides below

Outstanding work summary• Many variant calls cannot be assessed by

comparison to current benchmark callsets (>20% of SNPs, >50% of indels, ~100% of SVs outside our high-confidence regions)– Currently mostly assessing “easy” things

• No broadly consented tumor-normal cell lines are available

• Benchmarking tools for SVs are not standardized

GIAB Roadmap

Genome

Measurement

Science

Germline

Samples

Somatic

Samples

Benchmarking

Publications

2018 2019 2020

IRB

approval

Strategy for cell

line developer/

distributor

Using variant calls to

benchmark assemblies

Identify cell line

developer/

distributor

Small repeats

Difficult to map w/ phasing

Initial large indels/SVs

Challenging large

indels/SVs

Non-variant regions

for large indels/SVs

More difficult

variant calls in

all samples

Machine learning to

integrate indels/SVs

Further automate

arbitration/integration

for new techs and

difficult variants

X/Y

Complex

variants

Select

samples

for new

ancestries

Diploid assemblies

are important part of

integration

SV comparison

tools integrated into

Benchmarking

frameworkBenchmarking/

new integrated

calls

SV

Integration

Machine

learning

Paper with

HGSVC?

Predict

performance for

clinical variants of

interest

Establish

cell linesCharacterize

cell lines

Develop integration

methods for somatic

variants

Implement SV callers

Ultralong read

science