Joe Mychaleckyj Slide 1 Linkage Disequilibrium Joe Mychaleckyj Center for Public Health Genomics 982-1107 [email protected]

Joe Mychaleckyj

Slide 1

Linkage Disequilibrium

Joe MychaleckyjCenter for Public Health

Genomics982-1107

[email protected]

Slide 2

Joe Mychaleckyj

Today we’ll cover…

• Haplotypes• Linkage Disequilibrium• Visualizing LD• HapMap

Slide 3

Joe Mychaleckyj

References

Principles of Population Genetics, Fourth Edition (Hardcover) by Daniel L. Hartl, Andrew G. Clark (Author)

xx

x

Genetic Data Analysis II Bruce S WeirQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Slide 4

Joe Mychaleckyj

References

Statistical Genetics: Gene Mapping Through Linkage and Association Eds Benjamin M. Neale, Manuel A.R. Ferreira, Sarah E. Medland, Danielle Posthuma

Slide 5

Joe Mychaleckyj

SNP1 SNP2 SNP3

[A / T] [C / G] [A / G]

A C G

A C A

T G G

2N (ie very large diversity possible)

Haplotype: specific combination of alleles occurring (cis) on the same chromosome (segment of chromosome)

N SNPs - How many Haplotypes are possible ?

Slide 6

Joe Mychaleckyj

Terminology

• Haplotype: Specific combination (phasing) of alleles occurring (cis) on the same chromosomal segment

• Linkage/Linked Markers: Physical co-location of markers on the same chromosome

• Diplotype: Haplogenotype ie pair of phased haplotypes one maternally, one paternally inherited

Slide 7

Joe Mychaleckyj

SNP2 [ B / b ]SNP1 [ A / a ]

Major Allele Freq: p(A) p(B)

Minor Allele Freq: p(a) p(b)

Independently segregating SNPs:

Haplotype Frequency p(ab) = p(a) x p(b)

LINKAGE DISEQUILIBRIUM

Haplotype Frequency p(ab)≠ p(a) x p(b)

LINKAGE EQUILIBRIUM

(How many haplotypes in total ?)

Slide 8

Joe Mychaleckyj

Linkage Disequilibrium

• Non-random assortment of alleles at 2 (or more) loci

• The closer the markers, the stronger the LD since recombination will have occurred at a low rate

• Markers co-segregate within and between families

Slide 9

Joe Mychaleckyj

SNP1 Allele

A

a

SNP2 Allele

B b

p(A)p(B)

p(a)p(B)

p(A)p(b) p(A)

p(a)p(b) p(a)

p(B) p(b)

Example:

p(A)p(B)+p(a)p(B)=p(B){ p(A)+p(a)} = p(B)

* LINKAGE EQUILIBRIUM *Not a Punnett

Square!

Slide 10

Joe Mychaleckyj

SNP2 [ B / b ]SNP1 [ A / a ]

Major Allele Freq: p(A) p(B)

Minor Allele Freq: p(a) p(b)

LINKAGE DISEQUILIBRIUM

Haplotype Frequency p(ab) = p(a) p(b) + D

(sign of D is generally arbitrary, unless comparing D values between populations or studies)

D: Lewontin’s LD Parameter (Lewontin 1960)

Slide 11

Joe Mychaleckyj

SNP1 Allele

A

a

SNP2 Allele

B b

p(A)p(B)+D

p(a)p(B)-D

p(A)p(b)-D p(A)

p(a)p(b)+D p(a)

p(B) p(b)

p(A)p(B)+D + p(a)p(B)-D =p(B){ p(A)+p(a)} = p(B)

* LINKAGE DISEQUILIBRIUM *

Slide 12

Joe Mychaleckyj

0.16 0.04

0.14 0.66

a

A

b B

p(a)=0.20

p(B)=0.80

p(b)=0.30 p(B)=0.70

What is the LD ?

≠ 0

p(ab) ≠ p(a) p(b)

p(ab) = p(a) p(b) + D

0.16 = 0.2 x 0.3 + D

D = 0.1

Since p(ab) = p(a)p(b)+ D

+D was used and D is +ve here, but arbitrary

eg can relabel alleles A,B as minor

Slide 13

Joe Mychaleckyj

Range of D values (-ve to +ve)

D has a minimum and maximum value that depends on the allele frequencies of the markers

Since haplotype frequencies cannot be -ve

p(aB) = p(a)p(B) - D ≥ 0 D ≤ p(a)p(B)

p(Ab) = p(A)p(b) - D ≥ 0 D ≤ p(A)p(b)

These cannot both be true, so D ≤ min( p(a)p(B), p(A)p(b) )

p(ab) = p(a)p(b) + D ≥ 0 D ≥ -p(a)p(b)

p(AB) = p(A)p(B) + D ≥ 0 D ≥ -p(A)p(B)

These cannot both be true, so D ≥ max( -p(a)p(b), -p(A)p(B) )

* Similar equations if we had defined p(ab) = p(a)p(b) - D

Slide 14

Joe Mychaleckyj

Limits of D LD Parameter

Limits of D are a function of allele frequencies

Standardize D by rescaling to a proportion of its maximal value for the given allele frequencies (D') D’ = D

Dmax

Slide 15

Joe Mychaleckyj

D’ (Lewontin, 1964)

D’ = D / Dmax

Dmax = min (p(A)p(B), p(a)p(b)) D < 0

Dmax = min (p(A)p(b), p(a)p(B)) D > 0Again, sign of D’ depends on definition

D’ = 1 or -1 if one of p(A)p(B), p(A)p(b), p(a)p(B), p(a)p(b) = 0

= Complete LD (ie only 3 haplotypes seen)D’=1 or -1 suggests that no recombination has

taken place between markersBeware rare markers - may not have enough

power/sample size to detect 4th haplotype

Slide 16

Joe Mychaleckyj

D’ Interpretation

0.06 0.14

0.24 0.56

a

A

b B

p(a)=0.20p(A)=0.80

p(b)=0.30 p(B)=0.70

0.2 0

0.1 0.7

a

A

b B

p(a)=0.20P(A)=0.80

p(b)=0.30 p(B)=0.70

D=0 ; Dmax undefined D=Dmax =0.14 ; D’ = +1

p(a) = 0.2

p(b)= 0.3

D’=1 (perfect LD using D’ measure - No recombination between marker - Only 3 haplotypes are seen

Slide 17

Joe Mychaleckyj

Creation of LD

• Easiest to understand when markers are physically linked

• Creation of LD– Mutation– Founder effect– Admixture– Inbreeding / non-random mating– Selection– Population bottleneck or stratification– Epistatic interaction

• LD can occur between unlinked markers• Gametic phase disequilibrium is a more

general term

Slide 18

Joe Mychaleckyj

A

a

A B

A b

a B

SNP1

SNP1

SNP2

Recombinationn=2 haplotypesn=2 haplotypes

n=3 haplotypesn=3 haplotypes

SNP1

SNP2

A B

A b

a B

a b

n=4 haplotypesn=4 haplotypes

Slide 19

Joe Mychaleckyj

Destruction of LD

• Main force is recombination • Gene conversion may also act at

short distances (~ 100-1,000 bases)

• LD decays over time (generations of interbreeding)

Slide 20

Joe Mychaleckyj

Initial LD between SNP1 - SNP2: D0

After 1 generation

Preservation of LD:D1 = D0(1-θ)

After t generations:Dt = D0 (1- θ)t

SNP1 SNP2 Probability Recombination occurs = θ

Probability Recombination does not occur = 1-θ

NB: Overly simple model - does not account for allele frequency drift over time

Slide 21

Joe Mychaleckyj

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Dt = D0 (1-θ)t

Slide 22

Joe Mychaleckyj

r2 LD Parameter (Hill & Robertson, 1968)

• Squared correlation coefficient varies 0 - 1

• Frequency dependent• Better LD measure for allele correlation

between markers - predictive power of SNP1 alleles for those at SNP2

• Used extensively in disease gene or phenotype mapping through association testing

r2 = D2

p(a)p(b)p(A)p(B)

Slide 23

Joe Mychaleckyj

r2 Interpretation

0.06 0.14

0.24 0.56

a

A

b B

p(a)=0.20p(A)=0.80

p(b)=0.30 p(B)=0.70

0.2 0

0.1 0.7

a

A

b B

D=0 ; Dmax undefined D=Dmax =0.14 ; D’ = +1

r2 = 0 r2 = 0.14/0.24 = 0.58

p(a) = 0.2

p(b) = 0.3r2 ≠ 1 Correlation is not perfect, even

though D’ = 1

r2 = 1 if D’ = 1 and p(a) = p(b) = 0.3

p(a)=0.20p(A)=0.80

p(b)=0.30 p(B)=0.70

Slide 24

Joe Mychaleckyj

r2 Interpretationp(a) = 0.3

p(b) = 0.3Only 2 haplotypes:

r2 = 1 Correlation is perfect

D’ =1 (less than 4 haplotypes)

p(a) = p(b) (= 0.3 in this example)

• r2=1 when there is perfect correlation between markers and one genotype predicts the other exactly

– Only 2 haplotypes present

• D’ = 1 ≠> r2 = 1• No recombination AND markers must have

identical allele frequency– SNPs are of similar age

• Corollary– Low r2 values do not necessarily = high recombination– Discrepant allele frequencies

Slide 25

Joe Mychaleckyj

-1 D’ 1

0 r2 1

Common Measures of Linkage Disequilibrium

Recombination

Correlation

Other LD Measures exist, less common usage

Joe Mychaleckyj

Slide 26

Visualizing LD metrics

Slide 27

Joe Mychaleckyj

SNP1

SNP2

SNP3

SNP4

SNP5

SNP6

SNP1 2 3 4 5 6

0.2

0.6

0.8

1.0

0

| D’ |

Not usually worried about sign of D’

Slide 28

Joe Mychaleckyj

Slide 29

Joe Mychaleckyj

Haploview: TCN2 (r2)

Slide 30

Joe Mychaleckyj

Launched October 2002

http://www.hapmap.org

Slide 31

Joe Mychaleckyj

International HapMap Project• Initiated Oct 2002• Collaboration of scientists worldwide• Goal: describe common patterns of human

DNA sequence variation• Identify LD and haplotype distributions• Populations of different ancestry

(European, African, Asian)– Identify common haplotypes and population-specific differences

• Has had major impact on:– Understanding of human popualtion history as reflected in genetic

diversity and similarity– Design and analysis of genetic association studies

Slide 32

Joe Mychaleckyj

HapMap samples

• 90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI)

• 90 individuals (30 trios) of European descent from Utah (CEU)

• 45 Han Chinese individuals from Beijing (CHB)

• 44 Japanese individuals from Tokyo (JPT)

Slide 33

Joe Mychaleckyj

Project feasible because of:

• The availability of the human genome sequence• Databases of common SNPs (subsequently

enriched by HapMap) from which genotyping assays could be designed

• Development of inexpensive, accurate technologies for highthroughput SNP genotyping

• Web-based tools for storing and sharing data• Frameworks to address associated ethical and

cultural issues

Slide 34

Joe Mychaleckyj

HapMap goals

• Define patterns of genetic variation across human genome

• Guide selection of SNPs efficiently to “tag” common variants

• Public release of all data (assays, genotypes)• Phase I: 1.3 M markers in 269 people

1 SNP/5kb (1.3M markers)

Minor allele frequency (MAF) >5%

• Phase II: +2.8 M markers in 270 people

Slide 35

Joe Mychaleckyj

http://www.hapmap.org/

Slide 36

Joe Mychaleckyj

Slide 37

Joe Mychaleckyj

Slide 38

Joe Mychaleckyj

HapMap publications

• The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

• The International HapMap Consortium. The International HapMap Project. Nature 426, 789-796. 2003.

• The International HapMap Consortium. Integrating Ethics and Science in the International HapMap Project. Nature Reviews Genetics 5, 467 -475. 2004.

• Thorisson, G.A., Smith, A.V., Krishnan, L., and Stein, L.D. The International HapMap Project Web site. Genome Research,15:1591-1593. 2005.

Slide 39

Joe Mychaleckyj

ENCODE project

• Aim: To compare the genome-wide resource to a more complete database of common variation—one in which all common SNPs and many rarer ones have been discovered and tested

• Selected a representative collection of ten regions, each 500 kb in length

• Each 500-kb region was sequenced in 48 individuals, and all SNPs in these regions (discovered or in dbSNP) were genotyped in the complete set of 269 DNA samples

Slide 40

Joe Mychaleckyj

Comparison of linkage disequilibrium and recombination for two ENCODE regions

Nature 437, 1299-1320. 2005

Joe Mychaleckyj

Slide 41

LD in Human Populations

Slide 42

Joe Mychaleckyj

Haplotype Blocks

N SNPs = 2N Haplotypes possible, ie very large diversity possible

But: we do not see the full extent of haplotype diversity in human populations

Extensive LD especially at short distances eg ~20kbases.

Haplotypes are broken into blocks of markers with high mutual LD separated by recombination hotspots

Non-uniform LD across genome

Slide 43

Joe Mychaleckyj

Haplotype Blocks

Haplotype blocks: at least 80% of observed haplotypes with frequency >= 5% could be grouped into common patterns

Whole Genome Patterns of Common DNA Variation in Three Human Populations, Science 2005, Hinds et al.

Slide 44

Joe Mychaleckyj

Length of LD spans

r2

Slide 45

Joe Mychaleckyj

Example: Large block of LD on chromosome 17Cluster of common (frequent SNPs In high LD)518 SNPs, spanning 800 kb25% in EUR, 9% in AFR, missing in CHNGenes:

Microtubule-associated protein tauMutations associated with a variety of neurodegeneartive disordersGene coding for a protease similar to presenilinsMutations result in Alzheimer’s diseaseGene for corticotropin-releasing hormone receptor

• Immune, endocrine, autonomic, behavioral response to stress

Slide 46

Joe Mychaleckyj

Chromosome 17 LD Region

Prevalent inversion in EUR human population

~25%

Documents

Joe Mychaleckyj Slide 1 Linkage Disequilibrium Joe Mychaleckyj Center for Public Health Genomics 982-1107 [email protected]