Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
Lectures on Human GeneticsI. History of Disease Gene MappingII. Statistical GeneticsIII. Genes and Environment
7 Jan 2014
Jurg Ott, [email protected]
Institute of PsychologyChinese Academy of Sciences, Beijing
Rockefeller University, New YorkLaboratory of Statistical Geneticshttp://lab.rockefeller.edu/ott/
Statistical Inference
Estimation Method of moments Maximum likelihood method
Hypothesis testing Null and alternative hypothesis Test statistics Significance levels Randomization tests
2J. Ott "Statistical Genetics"
2
Maximum likelihood (ML) method
In n meioses, observe krecombinations (graph: k = 1, n = 5). Recombination probability = r(human genetics: θ)
Observed recombination rate, R = k/n.
Likelihood, L(r) = r k (1 – r)n – k
log[L(r)] = k log(r) + (n – k) log(1 – r)
knkrLd )](log[ k /ˆ
J. Ott "Statistical Genetics"
Setting first derivative, , equal to 0 leads to
Thus, sample mean = ML estimate of mean. Testing: Under H0 (r = ½), 2 × ln[L( )/L(½) ~ 2, 1 df
rkn
rk
rrLd
1
)](log[ nkr /ˆ
3
r̂
Estimating nonpaternity rateExclusion probabilities
Single marker with two galleles, f = P(A) = 0.4
Man is excluded (cannot possibly be the father) if and only if he is A/A, which has probability f 2.
Thus, exclusion probabilit t 0 16
A/a
/
4
probability, t = 0.16. For multiple independent
markers, total exclusion probability, T = 1 – Π(1 – ti)
J. Ott "Statistical Genetics"
a/a
3
Estimating nonpaternity rateNonpaternity vs exclusion probability
p = nonpaternity rate, constant
ti = exclusion probabilities
pti = probability that an exclusion is observed
For n children k For n children, kexclusions observed → exclusion rate, E = k/n.
J. Ott "Statistical Genetics" 5
Estimating nonpaternity rateMoment estimate
If the man is not the father, he is excluded with probability 1 if ti = 1. Sum = k.
If ti < 1, he is excluded with probability pti. Sum = Σpti.
Equating these two quantities leads to the moment estimate,
J. Ott "Statistical Genetics" 6
.ˆ
it
kp
4
Estimating nonpaternity rateML estimate
Let fi = probability of occurrence of each man’s status, either “excluded” or “not excluded.”
Thus, fi = pti if he is excluded, and fi = 1 – pti if he is not. Likelihood = probability of occurrence of the data,
L(p) = Πfi, and log[L(p)] = Σlog(fi). Thus,
excluded excl.not
)1log()log()](log[ ii ptptpL
First derivative:
J. Ott "Statistical Genetics" 7
excluded excl.not
)1log()log()log( ii pttpk
)1/(/)](log[excl.not iidpd pttpkpL
Estimates of pSasse et al (1994) Hum Hered 44, 337
Setting first derivative equal to zero leads to:
ˆ tkpzero leads to:
Requires iterative solution exclnot 1 iipt
tp
Such iterations my converge or diverge. Try! General scheme for iteratively finding ML estimates: Gene
counting, EM algorithm Study in Switzerland:
Number of children 1607
J. Ott "Statistical Genetics" 8
Exclusion rate, R 11/1607 = 0.0068 Moment estimate of p 0.0078 ML estimate of p 0.0078
Other populations: 1% - 20%
5
Estimating allele frequenciesCeppellini, Siniscalco & Smith (1955) Ann Hum Genet 20, 97-115
Principle of “gene (allele) counting” Principle of gene (allele) counting Simple example:
9J. Ott "Statistical Genetics"
Two Marker Loci (SNPs)
Locus 1:Alleles C and A, genotype C/A L 2 All l G d A t G/A Locus 2: Alleles G and A, genotype G/A Haplotype = set of alleles at different
loci (inherited in a gamete from one parent)
C G
1 2
(genetics)haplotypeOneics)(cytogenet chromosome One
C
A
G
A
(genetics)haplotypeOne
Other possible haplotypes:C-A, A-G
10J. Ott "Statistical Genetics"
6
Genotypes and Haplotypes
Locus 2Locus 1 G/G G/A A/A
C/C C-G, C-G C-G, C-A C-A, C-AC/A C-G, A-G ? C-A, A-AA/A A-G, A-G A-G, A-A A-A, A-A
orA-AG-CC G
G -A A,-C
orA -AG,-C?
C
A
G
A
C A A G11J. Ott "Statistical Genetics"
Counting Haplotypes
Locus 1Locus 2
G/G G/A A/A
Known haplotypes
No FreqNew
countsLocus 1 G/G G/A A/A
C/C 0 1 2
C/A 0 1 2
A/A 1 0 1
No. Freq
C-G 1 0.071
C-A 7 0.500
A-G 2 0.143
A-A 4 0.286
Total 14 1
bi l f
counts
1.221
7.779
2.779
4.221
16*) Assumes HWE
Ambiguous Frequency Rel. freq.1 C-G, 1 A-A 0.071 0.286 0.2211 C-A, 1 A-G 0.500 0.143 0.779
Sum 0.092 1
New counts0.221 C-G, 0.221 A-A0.779 C-A, 0.779 A-G
1 1
or
12J. Ott "Statistical Genetics"
7
EM Algorithm
Th it ti d h th i lid i The iterative procedure shown on the previous slides is known to lead to maximum likelihood estimates.
EM algorithm: Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc 39B, 1-38.
EM algorithm generally slower, but more stable, than Newton-Raphson and similar algorithms.
Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for doubly heterozygous individuals are given equal weight.
13J. Ott "Statistical Genetics"
Implementations
EH program (Xie & Ott Am J Hum Genet abstr 1993) EH program (Xie & Ott, Am J Hum Genet, abstr, 1993) snphap computer program, David Clayton, Cambridge
UK Estimation of haplotype frequencies by MLE using
different starting values. For individuals with multiple phases, genotypes with probability < 0.01 disregarded.
Assign (infer) haplotypes to individuals using MCMC approach (Gibbs sampling). Assumes a prior distribution
hl f h l f(Dirichlet) of haplotype frequencies. Phase program: Somewhat better then others (Marchini
et al [2006] Am J Hum Genet 78, 437-450), but modify default parameter values! (“p = 0.01”)
14J. Ott "Statistical Genetics"
8
Example: LEPR GeneHoehe (2003) Pharmacogenomics 4, 547-70
In 564 individuals, gene fully sequenced, g y q Found 83 SNPs Potential number of haplotypes = 283
= 9.7 1024. Most common haplotypes with estimated
frequencies:
11111111111111111112111111211121112211111111111111111111111111111111111111111111111 0.106
11111111111111111111111111111111111111111111111111111111111111111111111111111111111 0.081
11111111111111111111111111111121211111111111111111121111211111212211111112111112112 0.078
11111112121212111111112111211121112111111111111111111111212112111211111112111112111 0.056
15J. Ott "Statistical Genetics"
Estimation Results
T t l f 851 h l t ti t d t Total of 851 haplotypes estimated to be present.
Of these, 295 with f > 0.000,001 and 556 with f < 0.000,001
Smallest “real” frequency: 564 2 1128 h 1/1128 n = 564 2n = 1128 haps 1/1128 =
0.000,887
16J. Ott "Statistical Genetics"
9
Hap Frequencies > 0.000,001
0.04
0.06
0.08
0.1
0.12
0
0.02
1 18 35 52 69 86 103
120
137
154
171
188
205
222
239
256
273
290
17J. Ott "Statistical Genetics"
Enlargement
Horizontal line: Many values of 0.000,887“R l” b f h l t 240?
0.000840.000850.000860.000870.000880.000890.00090
“Real” number of haplotypes 240?
0.000800.000810.000820.00083
140 150 160 170 180 190 200 210 220 230 240 250
18J. Ott "Statistical Genetics"
10
Potential Solutions
Work with assigned/inferred haplotypesg / p yp Not the same as multiplying haplotype
frequencies by total number of haplotypes. Of the 1128 inferred haps, only 16 have
assignment probabilities < 0.50. Total of 265 different haplotypes inferred,
compared with 240 haplotypes with compared with 240 haplotypes with frequencies > 0.000,887
Problem: Different assignment schemes are based on different priors different results.
19J. Ott "Statistical Genetics"
Number Hap cases prop. controls prop. OR 1/OR chisquare1 GCCIGCA 253 0 4765 608 0 4780 0 99 1 01 0 004
Dataset from BeijingAssigned haplotypes, partial table
1 GCCIGCA 253 0.4765 608 0.4780 0.99 1.01 0.0042 ATADATA 120 0.2260 263 0.2068 1.12 0.89 0.8285 ACCIGCA 12 0.0226 60 0.0472 0.47 2.14 5.89913 ACADATA 0 0 5 0.0039 0 inf 2.09314 GCADATT 0 0 5 0.0039 0 inf 2.09315 GCCIGTA 0 0 5 0.0039 0 inf 2.09326 ACCDGCA 0 0 1 0.0008 0 inf 0.41827 ACCIATT 0 0 1 0.0008 0 inf 0.41836 GCCDGCT 2 0.0038 0 0 inf 0 4.79637 GCCIGTT 1 0.0019 0 0 inf 0 2.39738 GTCDATT 1 0.0019 0 0 inf 0 2.39739 GTCIATT 1 0.0019 0 0 inf 0 2.397
531 1 1272 1 25.832
20J. Ott "Statistical Genetics"
11
Pooling Haplotypes
Pearson ChiData
Pearson chi-sq df table p Fisher p
Chi-square table p
No pooling of cells 53.75 38 0.0466 0.0217 64.24 0.0049Cells with 0 in one group and 1 in the
other group are merged 53.75 28 0.0024 0.0016 64.24 0.0001Cells with 0 in one group are merged 53.75 21 0.0001
12
LD across genome
4-gamete test: Pairs of adjacent SNPs. The more haplotypes, the smaller LD the larger the recombination intensitythe smaller LD, the larger the recombination intensity
LEPR gene, sliding window of s SNPs:
s SNPs:Max. # haps = 2s
s 2s
Number of haplo-types seen in 564
Recomb. hot spot
7 1285 323 8
SNP number across gene
indiv-iduals
23J. Ott "Statistical Genetics"
Haplotype Frequencies from Family DataTerwilliger & Ott, Handbook, section 23.3
S l diSample pedigree
To estimate haplotypefrequencies (1) based on founder individuals alone (EH program) and (2) jointly with recombination fraction (ILINK program).
24J. Ott "Statistical Genetics"
13
Resulting Estimates
ILINK results could be different from founder results, even with θ = 0, when some founders are not genotyped but their genotype can be inferred from offspring.
25J. Ott "Statistical Genetics"
Errors in Genotyping Data
Meiosis
Ott (1977) Clin Genet 12:119-24
Meiosis
R N
p 1-p
r 1-r• In a phase-known mating,
can score offspring as k(apparent) recombinants and (n – k) non-recombinants.
• P(R’) = r(1 – p) + (1 – r)p = r + p(1 – 2r)
• P(R’) > P(R): Errors lead to p 1-p
R' R'N' N'an overestimate of the recombination fraction and increased map length.
26J. Ott "Statistical Genetics"
14
Simple Error Models for SNPsKeats, Sherman & Ott (1990) Cytogenet Cell Genet 55:387Lincoln & Lander (1992) Genomics 14:604
27J. Ott "Statistical Genetics"
The TDTSpielman, McGinnis, Ewens (1993) Am J Hum Genet 52, 506
Focus on heterozygous parents T t h th B i d t Test whether B is passed on to
child 50% of the time Null hypothesis: No linkage. Allows
using multiple affected offspring. Powerful only in presence of association.
J. Ott "Statistical Genetics" 28
15
Increase of false-positive rate in TDT due to genotype errorsHeath (1998) Am J Hum Genet 63 (suppl):A292
29J. Ott "Statistical Genetics"
TDTae Builds Errors into AnalysisGordon et al. (2001) Am J Hum Genet 69:371
30J. Ott "Statistical Genetics"
16
Interpreting R1 and R2
Strategy: Run TDTae under each of the three
J. Ott "Statistical Genetics" 31
Strategy: Run TDTae under each of the three inheritance models and collect results.
Plausible values of R1 and R2B
Biologically reasonable values in triangle A-B-C
J. Ott "Statistical Genetics" 32
AC
17
Genome-wide: TDTaeGordon et al (2004) Eur J Hum Genet 12, 752-761
89 trio families, 1,140,419 SNPs. After QC (MAF > 0.05, call rate > 96%): 762,867 SNPs
Run plink to find Mendel errors Display number N of errors by
family Table: 20 families with largest
numbers of errors
J. Ott "Statistical Genetics" 33
Initial analyses with plink
J. Ott "Statistical Genetics" 34
18
Results
J. Ott "Statistical Genetics" 35
AGRE Families, autism ~400,000 SNPs. 695 families ranked by number of
M d l Mendel errors
36J. Ott "Statistical Genetics"