Download pdf - 150224 grc kms

Characterizing extreme diversity in

the human genome using a single

haplotype genomic resource

Karyn Meltz Steinberg, Ph.D.

AGBT 2015 GRC Workshop

@KMS_Meltzy

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb

Types of genetic variants

Slide courtesy of S. Girirajan

Human Genetic Variation

1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb


1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant

How do we assay them?



1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb


SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant




1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb


Array-CGH

Karyotyping

SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant




1 bp 1 chr

Fre

quency

SNP

Trisomy

monosomy

Copy number

variants

Size of variant

1 kb 1 Mb


Array-CGH

Karyotyping

Sequencing

SNP genotyping

1 bp 1 chr

Thro

ughput

1 kb 1 Mb

Size of variant




Extreme diversity in the human genome

• <99.5% identity to the reference

• Refractory to traditional sequencing efforts

• Loci often contain gene families associated with

immune response and xenobiotic metabolism

HLA is a classic example of an extremely diverse locus

• Critical to immune response

• Characterized by overdominant

selection

• Alleles are linked and segregate as

distinct haplotypes

• Shaped by gene duplication and

diversification

Segmental duplications can predispose loci to further

rearrangement via NAHR

Segmental duplications can predispose loci to further

rearrangement via NAHR

A

A

C

T

C

G

C

C

Repeat Copies (noted by color difference)

Allelic

Copies

Diploid Genome

With a diploid genome, there is significant ambiguity

sorting allelic copies from repeat copies

A C C C

Haploid Genome

Repeat Copies

(ONLY but noted by color differences)

With a haploid genome, allelic differences are eliminated, and

base differences are likely indicative of repeat copies

Hydatidiform mole

SRGAP2 Homology between genes

Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs

Shows homology between SRGAP2B and SRGAP2C

Dennis, et.al. 2012

SRGAP2A

SRGAP2B

SRGAP2C

1q21

1q21 patch alignment to chromosome 1

1q32 1q21 1p21

Hydatidiform mole

Let’s sequence and assemble the whole genome!

CHM1_1.1 Assembly

• Reference-guided assembly • SRPRISM v2.3, R. Agarwala

• Alignment of Illumina reads to GRCh37 primary assembly

• CHORI-17 BAC clone tilepaths were then incorporated

• 428 total clones

• 324 clones in 45 tilepaths

• 104 clones as singletons

http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.

2

Total Sequence Length 3,037,866,619 bp

Total Assembly Gap Length 210,229,812 bp

Number of Scaffolds 163

Scaffold N50 50,362,920 bp

CHM1 Assembly Paper - Genome Research Steinberg et al. 2014

http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2

http://www.ncbi.nlm.nih.gov/pubmed/25373144?dopt=Abstract

CHM1_1.1 assembly is highly contiguous compared to

other WGS based assemblies

Integrating BAC tiling paths improved assembly

Integrating BAC tiling paths improved assembly

Alignment of CHM1 Illumina data to assembly revealed

regions of extreme heterogeneity

Heterozygous Homozygous Total

Variants 64033 22513 86546

In RepeatMasked (RM) sequence 37060 14833 51893

In Segmental duplication (SD) 30670 4843 35513

In RM and SD 51466 17174 68640

Ts:Tv 1.5 0.7 1.2

Mean SNV density/kb 0.02 0.008 0.03

There are significantly more heterozygous variants in repetitive

sequence than expected (p<1x10-16). BAC ends mapping discordantly

and in multiple loci are significantly enriched for segmental

duplications (p<1x10-5).

Identified 549 novel protein coding genes not annotated

in GRCh37

CHM1 BioNano Genome Map Aligned to GRCh38

GRCh38

CHM1 BioNano Map~15kb additional data

BioNano SV Calls Identified a Assembly Problems

Collapse

Expansi

on

in A

ssem

bly

Gap in SequenceCHM1_1.1 Assembly

CHM1 BioNano Map

Conclusion

• Extremely diverse regions of the genome are difficult to

characterize due to issues distinguishing allelic from

paralogous duplications

• CHM1_1.1 highly contiguous single haplotype

representation of the genome

• Identified regions of misassembly or reference-ized

regions

• Utilize long read technology and nanopore technology to

attempt to fix these regions

Need to add more diversity to reference

• Finish another hydatidiform mole to platinum

status

• Finish 5 genomes to gold status

• NA19240 (Yoruban)

• NA12878 (European)

• HG00513 (Han Chinese)

• 2 “wildcards”

• Looking for underrepresented minority population

• Add high quality alternative sequences to

reference to create a population reference graph

or “pan genome”

Use colored de Bruijn graph structure to represent

population reference graph

Bioinformatic tool development in the future

• Alignment of short reads to population reference

graph

• Variant calling

• Variant reporting/Haplotype resolution

Adapted from Weinstein et al, 2009

The GRCh37 reference sequence was assembled

from three lymphoblastoid cell lines

Not a true haplotype

Incomplete

The CH17 haplotype is quite different from the reference

Novel insertion


Complex Indel


Hotspot/Recurrent Mutation


60 kbp Insertion

(Hotspot)

African Asian European

Duplication (influenza)


44 kbp Duplication

(influenza)

African Asian European

Summary of hydatidiform mole sequence

• 47 functional V genes

• 24 total variants (SNV and CNV) involving 29 IGHV

genes

• 5 structural variants

• 19 single nucleotide variants

• 15 non-synonymous mutations

• 20 out of 24 variants represent differences in amino acid

sequence or gene copy number

Summary of hydatidiform mole sequence

• 47 functional V genes

• 24 total variants (SNV and CNV) involving 29 IGHV

genes

• 5 structural variants

• 19 single nucleotide variants

• 15 non-synonymous mutations

• 20 out of 24 variants represent differences in amino acid

sequence or gene copy number

100 kbp of novel sequence

Current status of CHM1 resources

• CHORI-17 BAC Library (created from CHM1 cell line)

• CHORI-17 BAC end sequences (n=325,659)

• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)

• CHORI-17 BACs (>750 have been sequenced, with 592 of them in

Genbank as phase 3)

• Active cell line

• >100X coverage Illumina 100bp reads

• 300, 500bp, 3kb inserts

• Reference assisted assembly CHM1_1.1

• BioNano genome map

• >50X coverage of PacBio long read data

CHM1_1.1 Assembly

• Reference-guided assembly – SRPRISM v2.3, R. Agarwala

• Alignment of Illumina reads to GRCh37 primary assembly

• CHORI-17 BAC clone tilepaths were then incorporated

• 428 total clones

• 324 clones in 45 tilepaths

• 104 clones as singletons

• Comparison back to GRCh37 reference to provide appropriate gaps sizes

• Assembly submitted to Genbank

• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2

• Steinberg et al, 2014• Genome Research (Dec;24(12):2066-76)

http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2

LILR (leukocyte

immunoglobulin-like

receptor)/KIR (killer

immunoglobulin receptor)

Immunoglobulin Kappa chain

Immunoglobulin Lambda chain

TCRA/B

17q21.31 inversion

polymorphism

Immunoglobulin

heavy chain locus

CYP2D6

SRGAP2

15q13.3

inversion

polymorphism