Upload
quant
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The Human Reference Assembly. Updating the assembly. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Updating the assembly. Oh No! Not a new version of the human genome!. Updating the assembly. GRCh37.p13 (160 regions: >3% of chromosomes). - PowerPoint PPT Presentation
Citation preview
The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI
@deannachurch Short Course in Medical Genetics 2013
Updating the assembly
Oh No! Not a new version of the human genome!
Updating the assembly
Updating the assembly
120 Fix PATCHES: Chromosome update in GRCh38
71 Novel PATCHES: Additional sequence added
(adds >5 Mb of novel sequence to the assembly)
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
GRCh37.p13(160 regions: >3% of chromosomes)
Summer of 2013
Assembly (e.g. GRCh37.p5)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
Data Model
GCA_000001405.6 /GCF_000001405.17
Primary Assembly
GCA_000001305.1/GCF_000001305.13
ALT 1
GCA_000001315.1/GCF_000001315.1
ALT 2
GCA_000001325.1/GCF_000001325.2
ALT 3
GCA_000001335.1/GCF_000001335.1
ALT 4
GCA_000001345.1/GCF_000001345.1
ALT 5
GCA_000001355.1/GCF_000001355.1
ALT 6
GCA_000001365.1/GCF_000001365.2
ALT 7
GCA_000001375.1/GCF_000001375.1
ALT 8
GCA_000001385.1/GCF_000001385.1
ALT 9
GCA_000001395.1/GCF_000001395.1
Patches GCA_000005045.5GCF_000005045.4
Non-nuclear assembly unit
(e.g. MT)
GCA_000006015.1/GCF_000006015.1
Data Model
GRCh38 is coming(September, 2013)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
http://genomereference.org
Why does missing sequence matter?
GRCh37
Sample genomeDuplicon A Duplicon B
Duplicon A
May or may not detect increased coverage depending on sequencing depthand library quality (easier to find with new technologies than with old, low through technologies)
x x
G>A (allelic difference – true variant)
G>C (paralogous sequence variant- false positive)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CDC27
1KG Phase 1 Strict accessibility mask
SNP (all)
SNP (not 1KG)
Sudmant et al., 2010
Kidd et al, 2007 APOBEC cluster
Part of chr22 assembly
Alternate locus for chr22
White: InsertionBlack: Deletion
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320
129S6/SVEvTac tiling path
Alignment to C57BL/6J chr1
B6 Genes
129S6/SvEvTac Genes
+ 32Kb in 129S6/SvEvTac
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
129S6/SvEvTac Alt Locus Alignment (allelic)
FVB/N Transcript Alignment (paralog)
129S6/SvEvTac Ren1
FVB Ren2 Tx
Paralogousdiff
SNP +Paralogous
diff
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
Dennis et al., 2012
1q32 1q21 1p21
1p21 patch alignment to chromosome 1
Preview of GRCh38 (scheduled Fall 2013)
TEX28 TKTL1
LOC101060233(opsin related)
LOC101060234(TEX28 related)
GRCh37 (current reference assembly)chrX
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-21
NCBI36 (hg18)
GRCh
37 (h
g19)
NCBI35 (hg17)
GRCh37 (hg19)
AL139246.20
AL139246.21
Fixing Rare/Incorrect Bases
Fixing Rare/Incorrect Bases
GRCh37B Sites for Update: n=1164Sites with unique successful ctg 1148 (98.6%)Avg Length 448 bpMin/Max Success Length 51/791 bpAvg Coverage 80x
Read Source (all contigs)High coverage 32%Low coverage 57%Exome 10%
Fixing Rare/Incorrect Bases
A = 0.000G=1.000
rs4732519
rs4732519
RP11 WGS reads
Private RP11 variant?Missing in 1000G?
FAM23_MRC1 Region, chr10
Segmental Duplications
1KG accessibility Mask
Novel Patch 250 kb of artificial duplication
Genovese et al., 2013
Adding Novel Sequence
Adding Novel Sequence
Karen Hayden and Jim Kent
Human Resolved for GRCh38
http://genomereference.org
Richa Agarwala
MHC Alternate locus
Alignment to chr6
Making the assembly accessible to existing tools: masking
Query set: 439,109,084 NA12878 HiSeq reads
Masking effectively blocks alignments in regions with high identity
Simulated reads from GRCh37.p9• Unpaired reads• 101 bp• 1x coverage• Default wgsim parameters
Masking parameters• Percent Id: 100%• Step size: 5 bp• Minimum length: 101 bp• Center SNPs in unmasked regions
Masking improves alignments in regions with alternate loci or patches
NA12878 reads whose best alignment was on an alt/patch in the masked assembly were evaluated for their alignment location when aligned to the primary assembly alone
Masking effectively reduces the increase in NA12878 reads that have alignments with MAPQ=0 that occurs when the full assembly is used as an alignment substrate
Take home messages
The assembly you use for analysis is an important part ofyour analysis package. The reference assembly is not a set of linear sequences butcan now represent allelic diversity
Tools still need to catch up. The human reference assembly is updating soon!(Remember: assemblies are not static if you are lucky!)