Upload
genome-reference-consortium
View
471
Download
0
Embed Size (px)
Citation preview
Getting the Most from the Reference Assembly
Valerie Schneider, Ph.D.NCBI
6 October 2015
http://genomereference.org
http://genomereference.org
Twitter: @[email protected]
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Reference Assembly Basics
Sims et al. (2014) Nat Rev Genet. 15(2):121-32
30x
1x
Reference Assembly Basics
Lander and Waterman(1988) Genomics SequencedNot sequenced
1X Coverage5X Coverage
10X Coverage
37% 63%0.6% 99.4%
0.005% 99.995%
Reference Assembly Basics
FINISHED?
BAC insertBAC vector
Shotgun sequence
Assemble
Fold
sequ
ence
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Clone based assemblies
Reference Assembly Basics
Minimal Tiling Path
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
Reference Assembly Basics
Oct. 2014: 13 assemblies
Oct. 2015: 25 assemblies
YRI
CEUCEU
CHB
Reference Assembly Basics
Sanger Sanger Illumina Illumina PacBio (older)clone WGS WGS WGS WGS
Reads:Method:
PacBio (newer)WGS
N50:Measure of continuity.Half of the contigs in the assembly are this length or greater. Why all this matters:
Longer haplotype blocksFewer collapsed repeats & segmental duplications
Improved annotationMore robust mapping target
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
GRC Assembly Model
many
Assembly (e.g. GRCh38)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091GRC Assembly Model
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
ALT 1
GRC Assembly Model
Alt loci alignments are an integral part of the assembly modelalignment to chr + scaffold sequence = Alt
GRCh38• 178 regions with alt loci: 2% of
chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence
relative to chromosomes• Average alt length = 400 kb, max = ~5 Mb
GRCh38
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci
GRCh38: Annotation Stats
GRCh38 Base Updates
Targeted PCR/WGS: n=91
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
GRCh38 Novel Sequence
GRCh38 Novel Sequence
Assembly (e.g. GRCh38.p1)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region
(FOXO6)Genomic
Region(FCGBP)
Assembly Updates
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE
ALT LOCI
--(integrated)
Treat as: Allelic
Treat as: Preferred
GRCh38.p4• 55 Patches: >400 kb novel
sequence• 37 FIX• 18 NOVEL
Assembly Updates
Learn more about assembly updates at
the GRC poster: 1834W (6-7 pm)
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
Accessing the Data
http://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
GRC Assembly Management
Accessing the Data
Accessing the Data
ftp://ngs.sanger.ac.uk/production/grit/track_hub/hub.txt
Accessing the Data
http://www.ncbi.nlm.nih.gov/variation/view/ NCBI Variation ViewerAccessing the Data
Learn more about viewing GRCh38 at
NCBI: 1748T (12-1 pm)
http://www.ncbi.nlm.nih.gov/genome/tools/remap
Outline
• Assembly basics• The assembly model• GRCh38 & updates• Taking advantage of the data
GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs
GRC Creditshttp://genomereference.org