Lectures on Human Genetics - Jurg Ott · Newton-Raphson and similar algorithms. Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for

1

Lectures on Human GeneticsI. History of Disease Gene MappingII. Statistical GeneticsIII. Genes and Environment

7 Jan 2014

Jurg Ott, [email protected]

Institute of PsychologyChinese Academy of Sciences, Beijing

Rockefeller University, New YorkLaboratory of Statistical Geneticshttp://lab.rockefeller.edu/ott/

Statistical Inference

Estimation Method of moments Maximum likelihood method

Hypothesis testing Null and alternative hypothesis Test statistics Significance levels Randomization tests

2J. Ott "Statistical Genetics"

2

Maximum likelihood (ML) method

In n meioses, observe krecombinations (graph: k = 1, n = 5). Recombination probability = r(human genetics: θ)

Observed recombination rate, R = k/n.

Likelihood, L(r) = r k (1 – r)n – k

log[L(r)] = k log(r) + (n – k) log(1 – r)

knkrLd )](log[ k /ˆ

J. Ott "Statistical Genetics"

Setting first derivative, , equal to 0 leads to

Thus, sample mean = ML estimate of mean. Testing: Under H0 (r = ½), 2 × ln[L( )/L(½) ~ 2, 1 df

rkn

rk

rrLd

1

)](log[ nkr /ˆ

3

r̂

Estimating nonpaternity rateExclusion probabilities

Single marker with two galleles, f = P(A) = 0.4

Man is excluded (cannot possibly be the father) if and only if he is A/A, which has probability f 2.

Thus, exclusion probabilit t 0 16

A/a

/

4

probability, t = 0.16. For multiple independent

markers, total exclusion probability, T = 1 – Π(1 – ti)

J. Ott "Statistical Genetics"

a/a

3

Estimating nonpaternity rateNonpaternity vs exclusion probability

p = nonpaternity rate, constant

ti = exclusion probabilities

pti = probability that an exclusion is observed

For n children k For n children, kexclusions observed → exclusion rate, E = k/n.

J. Ott "Statistical Genetics" 5

Estimating nonpaternity rateMoment estimate

If the man is not the father, he is excluded with probability 1 if ti = 1. Sum = k.

If ti < 1, he is excluded with probability pti. Sum = Σpti.

Equating these two quantities leads to the moment estimate,


.ˆ

it

kp

4

Estimating nonpaternity rateML estimate

Let fi = probability of occurrence of each man’s status, either “excluded” or “not excluded.”

Thus, fi = pti if he is excluded, and fi = 1 – pti if he is not. Likelihood = probability of occurrence of the data,

L(p) = Πfi, and log[L(p)] = Σlog(fi). Thus,

excluded excl.not

)1log()log()](log[ ii ptptpL

First derivative:


excluded excl.not

)1log()log()log( ii pttpk

)1/(/)](log[excl.not iidpd pttpkpL

Estimates of pSasse et al (1994) Hum Hered 44, 337

Setting first derivative equal to zero leads to:

ˆ tkpzero leads to:

Requires iterative solution exclnot 1 iipt

tp

Such iterations my converge or diverge. Try! General scheme for iteratively finding ML estimates: Gene

counting, EM algorithm Study in Switzerland:

Number of children 1607


Exclusion rate, R 11/1607 = 0.0068 Moment estimate of p 0.0078 ML estimate of p 0.0078

Other populations: 1% - 20%

5

Estimating allele frequenciesCeppellini, Siniscalco & Smith (1955) Ann Hum Genet 20, 97-115

Principle of “gene (allele) counting” Principle of gene (allele) counting Simple example:


Two Marker Loci (SNPs)

Locus 1:Alleles C and A, genotype C/A L 2 All l G d A t G/A Locus 2: Alleles G and A, genotype G/A Haplotype = set of alleles at different

loci (inherited in a gamete from one parent)

C G

1 2

(genetics)haplotypeOneics)(cytogenet chromosome One

C

A

G

A

(genetics)haplotypeOne

Other possible haplotypes:C-A, A-G


6

Genotypes and Haplotypes

Locus 2Locus 1 G/G G/A A/A

C/C C-G, C-G C-G, C-A C-A, C-AC/A C-G, A-G ? C-A, A-AA/A A-G, A-G A-G, A-A A-A, A-A

orA-AG-CC G

G -A A,-C

orA -AG,-C?

C

A

G

A

C A A G11J. Ott "Statistical Genetics"

Counting Haplotypes

Locus 1Locus 2

G/G G/A A/A

Known haplotypes

No FreqNew

countsLocus 1 G/G G/A A/A

C/C 0 1 2

C/A 0 1 2

A/A 1 0 1

No. Freq

C-G 1 0.071

C-A 7 0.500

A-G 2 0.143

A-A 4 0.286

Total 14 1

bi l f

counts

1.221

7.779

2.779

4.221

16*) Assumes HWE

Ambiguous Frequency Rel. freq.1 C-G, 1 A-A 0.071 0.286 0.2211 C-A, 1 A-G 0.500 0.143 0.779

Sum 0.092 1

New counts0.221 C-G, 0.221 A-A0.779 C-A, 0.779 A-G

1 1

or


7

EM Algorithm

Th it ti d h th i lid i The iterative procedure shown on the previous slides is known to lead to maximum likelihood estimates.

EM algorithm: Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc 39B, 1-38.

EM algorithm generally slower, but more stable, than Newton-Raphson and similar algorithms.

Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for doubly heterozygous individuals are given equal weight.


Implementations

EH program (Xie & Ott Am J Hum Genet abstr 1993) EH program (Xie & Ott, Am J Hum Genet, abstr, 1993) snphap computer program, David Clayton, Cambridge

UK Estimation of haplotype frequencies by MLE using

different starting values. For individuals with multiple phases, genotypes with probability < 0.01 disregarded.

Assign (infer) haplotypes to individuals using MCMC approach (Gibbs sampling). Assumes a prior distribution

hl f h l f(Dirichlet) of haplotype frequencies. Phase program: Somewhat better then others (Marchini

et al [2006] Am J Hum Genet 78, 437-450), but modify default parameter values! (“p = 0.01”)


8

Example: LEPR GeneHoehe (2003) Pharmacogenomics 4, 547-70

In 564 individuals, gene fully sequenced, g y q Found 83 SNPs Potential number of haplotypes = 283

= 9.7 1024. Most common haplotypes with estimated

frequencies:

11111111111111111112111111211121112211111111111111111111111111111111111111111111111 0.106

11111111111111111111111111111111111111111111111111111111111111111111111111111111111 0.081

11111111111111111111111111111121211111111111111111121111211111212211111112111112112 0.078

11111112121212111111112111211121112111111111111111111111212112111211111112111112111 0.056


Estimation Results

T t l f 851 h l t ti t d t Total of 851 haplotypes estimated to be present.

Of these, 295 with f > 0.000,001 and 556 with f < 0.000,001

Smallest “real” frequency: 564 2 1128 h 1/1128 n = 564 2n = 1128 haps 1/1128 =

0.000,887


9

Hap Frequencies > 0.000,001

0.04

0.06

0.08

0.1

0.12

0

0.02

1 18 35 52 69 86 103

120

137

154

171

188

205

222

239

256

273

290


Enlargement

Horizontal line: Many values of 0.000,887“R l” b f h l t 240?

0.000840.000850.000860.000870.000880.000890.00090

“Real” number of haplotypes 240?

0.000800.000810.000820.00083

140 150 160 170 180 190 200 210 220 230 240 250


10

Potential Solutions

Work with assigned/inferred haplotypesg / p yp Not the same as multiplying haplotype

frequencies by total number of haplotypes. Of the 1128 inferred haps, only 16 have

assignment probabilities < 0.50. Total of 265 different haplotypes inferred,

compared with 240 haplotypes with compared with 240 haplotypes with frequencies > 0.000,887

Problem: Different assignment schemes are based on different priors different results.


Number Hap cases prop. controls prop. OR 1/OR chisquare1 GCCIGCA 253 0 4765 608 0 4780 0 99 1 01 0 004

Dataset from BeijingAssigned haplotypes, partial table

1 GCCIGCA 253 0.4765 608 0.4780 0.99 1.01 0.0042 ATADATA 120 0.2260 263 0.2068 1.12 0.89 0.8285 ACCIGCA 12 0.0226 60 0.0472 0.47 2.14 5.89913 ACADATA 0 0 5 0.0039 0 inf 2.09314 GCADATT 0 0 5 0.0039 0 inf 2.09315 GCCIGTA 0 0 5 0.0039 0 inf 2.09326 ACCDGCA 0 0 1 0.0008 0 inf 0.41827 ACCIATT 0 0 1 0.0008 0 inf 0.41836 GCCDGCT 2 0.0038 0 0 inf 0 4.79637 GCCIGTT 1 0.0019 0 0 inf 0 2.39738 GTCDATT 1 0.0019 0 0 inf 0 2.39739 GTCIATT 1 0.0019 0 0 inf 0 2.397

531 1 1272 1 25.832


11

Pooling Haplotypes

Pearson ChiData

Pearson chi-sq df table p Fisher p

Chi-square table p

No pooling of cells 53.75 38 0.0466 0.0217 64.24 0.0049Cells with 0 in one group and 1 in the

other group are merged 53.75 28 0.0024 0.0016 64.24 0.0001Cells with 0 in one group are merged 53.75 21 0.0001

12

LD across genome

4-gamete test: Pairs of adjacent SNPs. The more haplotypes, the smaller LD the larger the recombination intensitythe smaller LD, the larger the recombination intensity

LEPR gene, sliding window of s SNPs:

s SNPs:Max. # haps = 2s

s 2s

Number of haplo-types seen in 564

Recomb. hot spot

7 1285 323 8

SNP number across gene

indiv-iduals


Haplotype Frequencies from Family DataTerwilliger & Ott, Handbook, section 23.3

S l diSample pedigree

To estimate haplotypefrequencies (1) based on founder individuals alone (EH program) and (2) jointly with recombination fraction (ILINK program).


13

Resulting Estimates

ILINK results could be different from founder results, even with θ = 0, when some founders are not genotyped but their genotype can be inferred from offspring.


Errors in Genotyping Data

Meiosis

Ott (1977) Clin Genet 12:119-24

Meiosis

R N

p 1-p

r 1-r• In a phase-known mating,

can score offspring as k(apparent) recombinants and (n – k) non-recombinants.

• P(R’) = r(1 – p) + (1 – r)p = r + p(1 – 2r)

• P(R’) > P(R): Errors lead to p 1-p

R' R'N' N'an overestimate of the recombination fraction and increased map length.


14

Simple Error Models for SNPsKeats, Sherman & Ott (1990) Cytogenet Cell Genet 55:387Lincoln & Lander (1992) Genomics 14:604


The TDTSpielman, McGinnis, Ewens (1993) Am J Hum Genet 52, 506

Focus on heterozygous parents T t h th B i d t Test whether B is passed on to

child 50% of the time Null hypothesis: No linkage. Allows

using multiple affected offspring. Powerful only in presence of association.


15

Increase of false-positive rate in TDT due to genotype errorsHeath (1998) Am J Hum Genet 63 (suppl):A292


TDTae Builds Errors into AnalysisGordon et al. (2001) Am J Hum Genet 69:371


16

Interpreting R1 and R2

Strategy: Run TDTae under each of the three


Strategy: Run TDTae under each of the three inheritance models and collect results.

Plausible values of R1 and R2B

Biologically reasonable values in triangle A-B-C


AC

17

Genome-wide: TDTaeGordon et al (2004) Eur J Hum Genet 12, 752-761

89 trio families, 1,140,419 SNPs. After QC (MAF > 0.05, call rate > 96%): 762,867 SNPs

Run plink to find Mendel errors Display number N of errors by

family Table: 20 families with largest

numbers of errors


Initial analyses with plink


18

Results


AGRE Families, autism ~400,000 SNPs. 695 families ranked by number of

M d l Mendel errors


Documents

Lectures on Human Genetics - Jurg Ott · Newton-Raphson and similar algorithms. Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for