Upload
genome-reference-consortium
View
195
Download
3
Embed Size (px)
DESCRIPTION
Assembly background slides from Valerie Schneider
Citation preview
Getting the Most from the Human
Genome: Understanding Updates
and Making Use of Improvements
in the Reference Assembly
ASHG
18 Oct 2014
Valerie Schneider (NCBI)
Tina Graves-Lindsay (TGI-WashU)
Deanna Church (Personalis)
Laura Clarke (EMBL-EBI)
Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes
GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Matthew Hurles• Richard Gibbs
GRC Credits
OutlineReference Assembly Basics
GRC: Assembly management and model
GRCh38
Accessing the assembly and data
http://genomereference.org
What is the Reference Assembly?
Reference Assembly Basics
Dilthey et al.Paten et al.
Reference Assembly Basics
Reference Assembly Basics
Reference ≠ Error-Free
• Highest quality mammalian genome, but it’s not perfect
• Errors can influence data that relies upon the reference
• Annotation
• Variant calling and interpretation
• Guided genome assemblies
What factors influence the production and quality
of genome assemblies?
Lander and Waterman
(1988) Genomics
Reads are randomly distributed
Overlap between reads does not vary
AssumptionsVariables:
G= haploid genome length in bp
L= sequence read length in bp
N= number of reads sequenced
T= amount of overlap needed for detection in bp
C= Coverage (C=LN/G)
Poisson distribution: P(Y=y)=(ly * e–l)/y!
y= number of events in an interval
l = mean number of events in an interval
Reference Assembly Basics
For sequence calculations, coverage can be viewed as l
Using this equation, you can calculate the probability that a base has
been sequenced y number of times.
By manipulating this formula, you can estimate the numbers of gaps for
any given level of coverage.
SequencedNot sequenced
1X Coverage
5X Coverage10X Coverage
37% 63%
0.6% 99.4%0.005% 99.995%
Reference Assembly Basics
Reference Assembly Basics
Sims et al. (2014) Nat Rev Genet. 15(2):121-32
Reference Assembly Basics
Even if you sequence to an “appropriate” coverage, you’re
still likely to have missing sequence in your assembly.
Complicating Factors:
• Library construction
• Cloning bias
• Sequencing Limitations
• Assembly method
• Underlying biology
BiologyRepetitive sequence (interspersed repeats, segmental duplications)
Variation(regions of high diversity, structural variation)
Kidd et al., 2008
Reference Assembly Basics
Reference Assembly Basics
Eugene Yaschenko, NCBI GRCh37
Technology
Read lengthlong reads vs. short reads
Mate lengthsdistribution of insert sizes
Read accuracyerror model for your technology
Read depthcoverage at each base
Genome distributionreads covering entire genome equally
Ajay et al., 2011
Genome Research, May, 1997
Reference Assembly Basics
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
WGS: Sanger Reads
Scaffold
Reference Assembly Basics
Contig: a sequence constructed from
smaller, overlapping sequences, which
contains no gaps.
Scaffold: a sequence constructed from
smaller sequences, which may contain
gaps.
Genome Vocabulary
Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ
Typically built from sequences in GenBank/EMBL/DDBJ
Reference Assembly Basics
Schatz et al, 2010
Reference Assembly Basics
A T T T T C C C T T C T G A A A T G A T G A A A G A G T C
Reference Assembly Basics
BAC insertBAC vector
Shotgun sequence
Assemble
Fold
seq
uen
ce
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Clone based assemblies
Reference Assembly Basics
A
BC
D
EF
GH
I
J
K
L
M
N
O
A
B
C
D
FGH
KL
O
N
Ideally…
Non-sequence based Map
(flip)
A
B
C
D
FGH
KL
O
N
Reference Assembly Basics
More like…
A
BC
D
EF
GH
I
J
K
L
M
N
O
A
B
C
ZYX
W
H
J
M
V
N
O
AB
HIJ
CD
Y
LM
N
O
AB
HIJ
LM
N
O
?
Reference Assembly Basics
Sequence vs. Non-sequence based maps
Mmu7
WI Genetic
WI/MRC RH
Human assemblies available in the NCBI assembly database
http://www.ncbi.nlm.nih.gov/assembly
Reference Assembly Basics
Reference Assembly Basics
Reference Assembly Basics
N50:Measure of continuity.Half of the contigs in the assembly are this length or greater.
Reference Assembly Basics
Fragmented genomes tend to have
more partial models
Fragmented genomes have
fewer frameshifts
Alexander Souvorov, NCBI
OutlineReference Assembly Basics
GRC: Assembly management and model
GRCh38
Accessing the assembly and data
http://genomereference.org
http://genomereference.org
Distributed data
Genome not in INSDC Database
Old Assembly Model
GRC Assembly Management
Human Genome Project (HGP)
AECOM BCM Beijing CGM CHGC CMGWCH CSHL GBF GS GTC IIGB-CNR
IMB JGI JST Keio MPIMG RIKEN SC SDSTDC SHGC TIGR Tokai
UOKNOR UTSW UUGC UWGC UWMSC WIBR WUGSC YMGC unknown
GRC Assembly Management
GRC Assembly Management
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
GRC Assembly Management
Issue tracking system (based on JIRA)
GRC Assembly Management
http://genomereference.org
GRC Assembly Management
GRC Assembly Management
http://genomereference.org
GRC Assembly Management
ACCESSION NAME CONTIG
GAP Telomere 10000
AP006221 XX-190A2 Hschr1_ctg1
AL627309 RP11-34P13 Hschr1_ctg1
GAP type-3
AC114498 RP5-857K21 Hschr1_ctg3
AL669831 RP11-206L10 Hschr1_ctg3
AL645608 RP11-54O7 Hschr1_ctg3
Tiling Path File (TPF)
GRC Assembly Management
Full Dovetail
Half-dovetail
Contained
Short/Blunt
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
GRC Assembly Management
Build sequence contigs based on contigs
defined in TPF (Tiling Path File).
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis
Switch point
Representative chromosome sequence
GRC Assembly Management
AGP: A Golden Path
Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types
GRC Produces
GRC Assembly Management
• AGP• FASTA
Distributed data
Old Assembly Model
Centralized Data
Updated Assembly Model
GRC Assembly Management
Genome not in INSDC Database
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRC Assembly Management
Assembly (e.g. GRCh38)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)
Genomic Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
The human reference genome assembly is not a haploid model
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
ALT 1
Alternate loci are not synonymous with haplotypes
GRC Assembly Management
Assembly (e.g. GRCh38.p1)
Primary Assembly
Unit
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 6
ALT 7
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)
Genomic Region(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
Patches
Genomic Region(ABO)
Genomic Region
(FOXO6)
Genomic Region
(FCGBP)
GRC Assembly Management
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE
ALT LOCI
--(integrated)
1q32 1q21 1p21
Dennis et al., 2012
Fix patches are different than novel patches
GRC Assembly Management
The alignments of the alternate loci scaffolds to the chromosomes are part of the assembly
Anatomy of an alt
Alignment Legend
no alignmentmismatchdeletion
Anatomy of an alt
AC012314.8
CU151838.1
ALT LOCI
AC012314.8
AC245052.3 CHR. 19
Alternate loci contain some sequence that is redundant to the primary assembly unit
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffolds
Simulated Reads
GRCh38: Alt Loci
GRC Assembly Management
GRCh38.p1• 192 Regions
• 261 ALT LOCI (61.9 Mb, 3.6 Mb unique to alts)
• PATCHES (94 kb unique to patches)
• 13 FIX
• 3 NOVEL
GRCh38: Alt Loci
GRCh38: Alt Loci
GRCh38 alt loci alignment
GRCh37 chr. 7
chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci
Distributed data
Genome not in INSDC Database
Old Assembly Model
Centralized Data
Updated Assembly Model
Genome in INSDC Database
Genome not in INSDC Database
GRC Assembly Management
OutlineReference Assembly Basics
GRC: Assembly management and model
GRCh38
Accessing the assembly and data
http://genomereference.org
GRCh38: Annotation Stats
GRCh38 Base Updates
Targeted PCR/WGS: n=91
GRCh38 Sequence Updates
Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components
79% of these bases are heterozygous in RP11 WGS
n=10489
GRCh38 Centromeres
Miga et al., Genome Res. 2014 Apr;24(4):697-707
GRCh38 Model Centromeres
GRCh38 Impact
NOVEL GENES!
GRCh38 Impact
Sudmant et al., 2010
GRCh38: Novel Sequence
GRCh38 Impact
OutlineReference Assembly Basics
GRC: Assembly management and model
GRCh38
Accessing the assembly and data
http://genomereference.org
Accessing the Data
http://genomereference.org
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
Accessing the Data
http://twitter.com/[email protected]
Accessing the Data
http://genomeref.blogspot.com/
Accessing the Data
Accessing the Data
Accessing the Data
SearchGene and exon navigator
Variant Filter
Variant Table
Sequence Viewer
Slide: Peter Cooper, NCBI
http://www.ncbi.nlm.nih.gov/variation/view/
NCBI Variation Viewer
Accessing the Data
Accessing the Data
Accessing the Data
http://www.ncbi.nlm.nih.gov/genome/tools/remap
Accessing the Data