Upload
genome-reference-consortium
View
170
Download
0
Embed Size (px)
Citation preview
GRC/GIAB Workshop:Getting the Most from the Reference
Assembly and Reference MaterialsOct 17: 1-4 pm
Valerie Schneider (NCBI): GRCh38 assembly basics and updatesTina Lindsay (MGI): Reference-grade human assemblies
Karen Miga (UCSC): Centromere assembliesBREAK (15 min)
Benedict Paten (UCSC): Building human variation graphsFritz Sedlazeck (BCM): Structural Variation Characterization Across the Human Genome and Populations
Justin Zook (NIST): GIAB benchmarks for difficult variants
GRCh38 assembly basics and updates
Valerie Schneider, Ph.D.NCBI
17 October 2017
https://genomereference.org
• Assembly basics• GRCh38 updates• Taking advantage of the data
Outline
Assembly Basics
Reference Assembly Basics
(For updated assemblies, only date of initial submission is counted)
Other assemblies
GRCh38(reference)
Sanger-seq’d, clone-based assembly BAC insertBAC vector
Shotgun sequence clone
Assemble clone
GAPS
Finish (via PCR)
Minimal Clone Tiling Path
Define consensus from switch points of adjacent clones
Consequences:• Highly contiguous• High sequence accuracy (<10-5)• Haploid mosaic
Ordering the Path
Fingerprint maps
Genetic linkage maps
Radiation hybrid maps
Reference Assembly Basics
HuRef
SOAPdenovoNA12878
ALLPATHSNA12878
Lander and Waterman
(1988) Genomics
SequencedNot sequenced
1X Coverage
5X Coverage
10X Coverage
37% 63%
0.6% 99.4%
0.005% 99.995%
The likelihood a base is seq’d.Coverage
Contig N50
MHAPCHM1
Chaisson and Eichler (2015), with modification
Measure of contiguity. Half of the assembly is in contigs this length or greater.
Reference Assembly Basics
AK1
HX1NA12878_prelim
Why all this matters:Longer haplotype blocks
Fewer collapsed repeats & segmental duplicationsBetter annotation
More robust mapping target
Reference Assembly Basics
Today’s reference assembly does not represent:1.The most common allele/haplotype
2.The longest allele/haplotype3.The ancestral allele/haplotype
It represents the sequence available from the HGP
Reference Assembly Basics
Gene1 Gene2
Gene1
Sample
Ref
Assembly
Reference assembly influence
Slide Credit: Deanna Church Reference Assembly Basics
75 % off-target alignments
25% no alignment
chromosome
variant
PLoS Biology (Jul 5, 2011)
Sequences from haplotype 1
Sequences from haplotype 2
Reference Assembly Basics
Original assembly model:compress into a consensus
falsegap
chromosome
Current assembly model:represent both haplotypes
alt loci scaffold
chromosomemanyGene1 Gene2
Sample
Gene2
Gene1chromosome
alt scaffold
Reference
GRCh38 (Dec. 2013)• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
• >150 genes only represented on alt loci
Reference Assembly Basics
Reference Assembly Basics
• Closed gaps• Targeted base fixes• Corrected path errors• Addition of missing paralogs• Better representation of variation• Better annotation• Modeled centromeres• Genome Research 27(5):849-864
(2017)• PubMed: 28396521
GRCh38
• Changed coordinates• Remapping challenges
• Alt Loci Usability• Allelic duplication/Aligners• Reporting multiple locations• Variant analysis
• Clinical validation
2016Growth in SRA submission
over prior year
GRCh38
GRCh37
Outline
• Assembly basics• GRCh38 updates• Taking advantage of the data
GRCh38 Updates
GRCh38: Dec. 2013
(n=1797)
(n=1396) (n=401)
GRCh38 Updates
(rare allele analysis)
GRCh38 Updates
chromosome
novel patch scaffold
alt loci scaffold
chromosome
fix patch scaffold
Patch release: No change to chromosome coordinatesAssembly nomenclature: GRCh38.p$
GRCh38.p11• 64 FIX, 59 NOVEL
• Added >1.5 Mb novel
sequence
• >20 genes affected
GRCh38 Updates
GRCh38: 5S rRNA cluster under-represented (19 copies)
GRCh38 patch: 5S rRNA cluster valid representation (35 copies)
Poster 423F (11:30-12:30)Updates to the human
reference genome assemblyTayebeh Rezaie
GRCh38 Updates
• Ideals:• Chromosome context for any
common human sequence >500 bp• Unambiguous data interpretation at
all clinically relevant loci• No systematic error/bias in
genome-wide analyses
• Real-World:• Community interest• Resources for curation
• GRCh39• Substantial added value• User must-haves
Outline
• Assembly basics• GRCh38 & updates• Taking advantage of the data
Accessing the Data
Assembly Stats
https://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
https://www.ncbi.nlm.nih.gov/genome/gdv/
Learn more about GDV:
Data CoLab #159 Weds 10:30-11:00
Poster 1531WWeds 2:00-3:00
Accessing the Data
Assembly Support Track Set
Accessing the Data
ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
Outline
• Assembly basics• GRCh38 updates• Taking advantage of the data
Credits
GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• Karyn Meltz Steinberg• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs
GRCTina Graves-LindsayTayebeh RezaieKerstin HoweRichard DurbinPaul FlicekLaura ClarkeDeanna ChurchCurators!Developers!