Characterizing extreme diversity in
the human genome using a single
haplotype genomic resource
Karyn Meltz Steinberg, Ph.D.
AGBT 2015 GRC Workshop
@KMS_Meltzy
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Slide courtesy of S. Girirajan
Human Genetic Variation
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
1 bp 1 chr
Fre
quency
SNP
Trisomy
monosomy
Copy number
variants
Size of variant
1 kb 1 Mb
Types of genetic variants
Array-CGH
Karyotyping
Sequencing
SNP genotyping
1 bp 1 chr
Thro
ughput
1 kb 1 Mb
Size of variant
How do we assay them?
Slide courtesy of S. Girirajan
Human Genetic Variation
Extreme diversity in the human genome
• <99.5% identity to the reference
• Refractory to traditional sequencing efforts
• Loci often contain gene families associated with
immune response and xenobiotic metabolism
HLA is a classic example of an extremely diverse locus
• Critical to immune response
• Characterized by overdominant
selection
• Alleles are linked and segregate as
distinct haplotypes
• Shaped by gene duplication and
diversification
Segmental duplications can predispose loci to further
rearrangement via NAHR
Segmental duplications can predispose loci to further
rearrangement via NAHR
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity
sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies
(ONLY but noted by color differences)
With a haploid genome, allelic differences are eliminated, and
base differences are likely indicative of repeat copies
Hydatidiform mole
SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
Dennis, et.al. 2012
SRGAP2A
SRGAP2B
SRGAP2C
1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21
Hydatidiform mole
Let’s sequence and assemble the whole genome!
CHM1_1.1 Assembly
• Reference-guided assembly • SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.
2
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
CHM1 Assembly Paper - Genome Research Steinberg et al. 2014
CHM1_1.1 assembly is highly contiguous compared to
other WGS based assemblies
Integrating BAC tiling paths improved assembly
Integrating BAC tiling paths improved assembly
Alignment of CHM1 Illumina data to assembly revealed
regions of extreme heterogeneity
Heterozygous Homozygous Total
Variants 64033 22513 86546
In RepeatMasked (RM) sequence 37060 14833 51893
In Segmental duplication (SD) 30670 4843 35513
In RM and SD 51466 17174 68640
Ts:Tv 1.5 0.7 1.2
Mean SNV density/kb 0.02 0.008 0.03
There are significantly more heterozygous variants in repetitive
sequence than expected (p<1x10-16). BAC ends mapping discordantly
and in multiple loci are significantly enriched for segmental
duplications (p<1x10-5).
Identified 549 novel protein coding genes not annotated
in GRCh37
CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map~15kb additional data
BioNano SV Calls Identified a Assembly Problems
Collapse
Expansi
on
in A
ssem
bly
Gap in SequenceCHM1_1.1 Assembly
CHM1 BioNano Map
Conclusion
• Extremely diverse regions of the genome are difficult to
characterize due to issues distinguishing allelic from
paralogous duplications
• CHM1_1.1 highly contiguous single haplotype
representation of the genome
• Identified regions of misassembly or reference-ized
regions
• Utilize long read technology and nanopore technology to
attempt to fix these regions
Need to add more diversity to reference
• Finish another hydatidiform mole to platinum
status
• Finish 5 genomes to gold status
• NA19240 (Yoruban)
• NA12878 (European)
• HG00513 (Han Chinese)
• 2 “wildcards”
• Looking for underrepresented minority population
• Add high quality alternative sequences to
reference to create a population reference graph
or “pan genome”
Use colored de Bruijn graph structure to represent
population reference graph
Bioinformatic tool development in the future
• Alignment of short reads to population reference
graph
• Variant calling
• Variant reporting/Haplotype resolution
Adapted from Weinstein et al, 2009
The GRCh37 reference sequence was assembled
from three lymphoblastoid cell lines
Not a true haplotype
Incomplete
The CH17 haplotype is quite different from the reference
Novel insertion
The CH17 haplotype is quite different from the reference
Complex Indel
The CH17 haplotype is quite different from the reference
Hotspot/Recurrent Mutation
The CH17 haplotype is quite different from the reference
60 kbp Insertion
(Hotspot)
African Asian European
Duplication (influenza)
The CH17 haplotype is quite different from the reference
44 kbp Duplication
(influenza)
African Asian European
Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
Summary of hydatidiform mole sequence
• 47 functional V genes
• 24 total variants (SNV and CNV) involving 29 IGHV
genes
• 5 structural variants
• 19 single nucleotide variants
• 15 non-synonymous mutations
• 20 out of 24 variants represent differences in amino acid
sequence or gene copy number
100 kbp of novel sequence
Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data
CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Steinberg et al, 2014• Genome Research (Dec;24(12):2066-76)
LILR (leukocyte
immunoglobulin-like
receptor)/KIR (killer
immunoglobulin receptor)
Immunoglobulin Kappa chain
Immunoglobulin Lambda chain
TCRA/B
17q21.31 inversion
polymorphism
Immunoglobulin
heavy chain locus
CYP2D6
SRGAP2
15q13.3
inversion
polymorphism