31
GRC/GIAB Workshop: Getting the Most from the Reference Assembly and Reference Materials Oct 17: 1-4 pm Valerie Schneider (NCBI): GRCh38 assembly basics and updates Tina Lindsay (MGI): Reference-grade human assemblies Karen Miga (UCSC): Centromere assemblies BREAK (15 min) Benedict Paten (UCSC): Building human variation graphs Fritz Sedlazeck (BCM): Structural Variation Characterization Across the Human Genome and Populations Justin Zook (NIST): GIAB benchmarks for difficult variants

Ashg2017 workshop schneider

Embed Size (px)

Citation preview

Page 1: Ashg2017 workshop schneider

GRC/GIAB Workshop:Getting the Most from the Reference

Assembly and Reference MaterialsOct 17: 1-4 pm

Valerie Schneider (NCBI): GRCh38 assembly basics and updatesTina Lindsay (MGI): Reference-grade human assemblies

Karen Miga (UCSC): Centromere assembliesBREAK (15 min)

Benedict Paten (UCSC): Building human variation graphsFritz Sedlazeck (BCM): Structural Variation Characterization Across the Human Genome and Populations

Justin Zook (NIST): GIAB benchmarks for difficult variants

Page 2: Ashg2017 workshop schneider

GRCh38 assembly basics and updates

Valerie Schneider, Ph.D.NCBI

17 October 2017

https://genomereference.org

Page 3: Ashg2017 workshop schneider

https://genomereference.org

Twitter: @GenomeRef

Announcements: [email protected]

Page 4: Ashg2017 workshop schneider

• Assembly basics• GRCh38 updates• Taking advantage of the data

Outline

Page 5: Ashg2017 workshop schneider

Assembly Basics

Page 6: Ashg2017 workshop schneider

Reference Assembly Basics

(For updated assemblies, only date of initial submission is counted)

Other assemblies

GRCh38(reference)

Page 7: Ashg2017 workshop schneider
Page 8: Ashg2017 workshop schneider

Sanger-seq’d, clone-based assembly BAC insertBAC vector

Shotgun sequence clone

Assemble clone

GAPS

Finish (via PCR)

Minimal Clone Tiling Path

Define consensus from switch points of adjacent clones

Consequences:• Highly contiguous• High sequence accuracy (<10-5)• Haploid mosaic

Ordering the Path

Fingerprint maps

Genetic linkage maps

Radiation hybrid maps

Reference Assembly Basics

Page 9: Ashg2017 workshop schneider

HuRef

SOAPdenovoNA12878

ALLPATHSNA12878

Lander and Waterman

(1988) Genomics

SequencedNot sequenced

1X Coverage

5X Coverage

10X Coverage

37% 63%

0.6% 99.4%

0.005% 99.995%

The likelihood a base is seq’d.Coverage

Contig N50

MHAPCHM1

Chaisson and Eichler (2015), with modification

Measure of contiguity. Half of the assembly is in contigs this length or greater.

Reference Assembly Basics

AK1

HX1NA12878_prelim

Page 10: Ashg2017 workshop schneider

Why all this matters:Longer haplotype blocks

Fewer collapsed repeats & segmental duplicationsBetter annotation

More robust mapping target

Reference Assembly Basics

Page 11: Ashg2017 workshop schneider

Today’s reference assembly does not represent:1.The most common allele/haplotype

2.The longest allele/haplotype3.The ancestral allele/haplotype

It represents the sequence available from the HGP

Reference Assembly Basics

Page 12: Ashg2017 workshop schneider

Gene1 Gene2

Gene1

Sample

Ref

Assembly

Reference assembly influence

Slide Credit: Deanna Church Reference Assembly Basics

75 % off-target alignments

25% no alignment

chromosome

variant

PLoS Biology (Jul 5, 2011)

Page 13: Ashg2017 workshop schneider

Sequences from haplotype 1

Sequences from haplotype 2

Reference Assembly Basics

Original assembly model:compress into a consensus

falsegap

chromosome

Current assembly model:represent both haplotypes

alt loci scaffold

chromosomemanyGene1 Gene2

Sample

Gene2

Gene1chromosome

alt scaffold

Reference

Page 14: Ashg2017 workshop schneider

GRCh38 (Dec. 2013)• 178 regions with alt loci: 2% of chromosome

sequence (61.9 Mb)

• 261 Alt Loci: 3.6 Mb novel sequence relative to

chromosomes

• Average alt length = 400 kb, max = ~5 Mb

• >150 genes only represented on alt loci

Reference Assembly Basics

Page 15: Ashg2017 workshop schneider

Reference Assembly Basics

• Closed gaps• Targeted base fixes• Corrected path errors• Addition of missing paralogs• Better representation of variation• Better annotation• Modeled centromeres• Genome Research 27(5):849-864

(2017)• PubMed: 28396521

GRCh38

• Changed coordinates• Remapping challenges

• Alt Loci Usability• Allelic duplication/Aligners• Reporting multiple locations• Variant analysis

• Clinical validation

2016Growth in SRA submission

over prior year

GRCh38

GRCh37

Page 16: Ashg2017 workshop schneider

Outline

• Assembly basics• GRCh38 updates• Taking advantage of the data

Page 17: Ashg2017 workshop schneider

GRCh38 Updates

GRCh38: Dec. 2013

(n=1797)

(n=1396) (n=401)

Page 18: Ashg2017 workshop schneider

GRCh38 Updates

(rare allele analysis)

Page 19: Ashg2017 workshop schneider

GRCh38 Updates

chromosome

novel patch scaffold

alt loci scaffold

chromosome

fix patch scaffold

Patch release: No change to chromosome coordinatesAssembly nomenclature: GRCh38.p$

GRCh38.p11• 64 FIX, 59 NOVEL

• Added >1.5 Mb novel

sequence

• >20 genes affected

Page 20: Ashg2017 workshop schneider

GRCh38 Updates

GRCh38: 5S rRNA cluster under-represented (19 copies)

GRCh38 patch: 5S rRNA cluster valid representation (35 copies)

Poster 423F (11:30-12:30)Updates to the human

reference genome assemblyTayebeh Rezaie

Page 21: Ashg2017 workshop schneider

GRCh38 Updates

• Ideals:• Chromosome context for any

common human sequence >500 bp• Unambiguous data interpretation at

all clinically relevant loci• No systematic error/bias in

genome-wide analyses

• Real-World:• Community interest• Resources for curation

• GRCh39• Substantial added value• User must-haves

Page 22: Ashg2017 workshop schneider

Outline

• Assembly basics• GRCh38 & updates• Taking advantage of the data

Page 23: Ashg2017 workshop schneider

Accessing the Data

Assembly Stats

https://genomereference.org

Page 24: Ashg2017 workshop schneider

Accessing the Data

Page 25: Ashg2017 workshop schneider

Accessing the Data

Page 26: Ashg2017 workshop schneider

Accessing the Data

https://www.ncbi.nlm.nih.gov/genome/gdv/

Learn more about GDV:

Data CoLab #159 Weds 10:30-11:00

Poster 1531WWeds 2:00-3:00

Page 27: Ashg2017 workshop schneider

Accessing the Data

Assembly Support Track Set

Page 28: Ashg2017 workshop schneider

Accessing the Data

http://www.ensembl.org/GRC Tracks

Page 29: Ashg2017 workshop schneider

Accessing the Data

ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt

Page 30: Ashg2017 workshop schneider

Outline

• Assembly basics• GRCh38 updates• Taking advantage of the data

Page 31: Ashg2017 workshop schneider

Credits

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• Karyn Meltz Steinberg• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRCTina Graves-LindsayTayebeh RezaieKerstin HoweRichard DurbinPaul FlicekLaura ClarkeDeanna ChurchCurators!Developers!