18
1 Lectures on Human Genetics I. History of Disease Gene Mapping II. Statistical Genetics III. Genes and Environment 7 Jan 2014 Jurg Ott, Ph.D. [email protected] Institute of Psychology Chinese Academy of Sciences, Beijing Rockefeller University, New York Laboratory of Statistical Genetics http://lab.rockefeller.edu/ott/ Statistical Inference Estimation Method of moments Maximum likelihood method Hypothesis testing Null and alternative hypothesis Test statistics Significance levels Randomization tests 2 J. Ott "Statistical Genetics"

Lectures on Human Genetics - Jurg Ott · Newton-Raphson and similar algorithms. Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    Lectures on Human GeneticsI. History of Disease Gene MappingII. Statistical GeneticsIII. Genes and Environment

    7 Jan 2014

    Jurg Ott, [email protected]

    Institute of PsychologyChinese Academy of Sciences, Beijing

    Rockefeller University, New YorkLaboratory of Statistical Geneticshttp://lab.rockefeller.edu/ott/

    Statistical Inference

    Estimation Method of moments Maximum likelihood method

    Hypothesis testing Null and alternative hypothesis Test statistics Significance levels Randomization tests

    2J. Ott "Statistical Genetics"

  • 2

    Maximum likelihood (ML) method

    In n meioses, observe krecombinations (graph: k = 1, n = 5). Recombination probability = r(human genetics: θ)

    Observed recombination rate, R = k/n.

    Likelihood, L(r) = r k (1 – r)n – k

    log[L(r)] = k log(r) + (n – k) log(1 – r)

    knkrLd )](log[ k /ˆ

    J. Ott "Statistical Genetics"

    Setting first derivative, , equal to 0 leads to

    Thus, sample mean = ML estimate of mean. Testing: Under H0 (r = ½), 2 × ln[L( )/L(½) ~ 2, 1 df

    rkn

    rk

    rrLd

    1

    )](log[ nkr /ˆ

    3

    Estimating nonpaternity rateExclusion probabilities

    Single marker with two galleles, f = P(A) = 0.4

    Man is excluded (cannot possibly be the father) if and only if he is A/A, which has probability f 2.

    Thus, exclusion probabilit t 0 16

    A/a

    /

    4

    probability, t = 0.16. For multiple independent

    markers, total exclusion probability, T = 1 – Π(1 – ti)

    J. Ott "Statistical Genetics"

    a/a

  • 3

    Estimating nonpaternity rateNonpaternity vs exclusion probability

    p = nonpaternity rate, constant

    ti = exclusion probabilities

    pti = probability that an exclusion is observed

    For n children k For n children, kexclusions observed → exclusion rate, E = k/n.

    J. Ott "Statistical Genetics" 5

    Estimating nonpaternity rateMoment estimate

    If the man is not the father, he is excluded with probability 1 if ti = 1. Sum = k.

    If ti < 1, he is excluded with probability pti. Sum = Σpti.

    Equating these two quantities leads to the moment estimate,

    J. Ott "Statistical Genetics" 6

    it

    kp

  • 4

    Estimating nonpaternity rateML estimate

    Let fi = probability of occurrence of each man’s status, either “excluded” or “not excluded.”

    Thus, fi = pti if he is excluded, and fi = 1 – pti if he is not. Likelihood = probability of occurrence of the data,

    L(p) = Πfi, and log[L(p)] = Σlog(fi). Thus,

    excluded excl.not

    )1log()log()](log[ ii ptptpL

    First derivative:

    J. Ott "Statistical Genetics" 7

    excluded excl.not

    )1log()log()log( ii pttpk

    )1/(/)](log[excl.not iidpd pttpkpL

    Estimates of pSasse et al (1994) Hum Hered 44, 337

    Setting first derivative equal to zero leads to:

    ˆ tkpzero leads to:

    Requires iterative solution exclnot 1 iipt

    tp

    Such iterations my converge or diverge. Try! General scheme for iteratively finding ML estimates: Gene

    counting, EM algorithm Study in Switzerland:

    Number of children 1607

    J. Ott "Statistical Genetics" 8

    Exclusion rate, R 11/1607 = 0.0068 Moment estimate of p 0.0078 ML estimate of p 0.0078

    Other populations: 1% - 20%

  • 5

    Estimating allele frequenciesCeppellini, Siniscalco & Smith (1955) Ann Hum Genet 20, 97-115

    Principle of “gene (allele) counting” Principle of gene (allele) counting Simple example:

    9J. Ott "Statistical Genetics"

    Two Marker Loci (SNPs)

    Locus 1:Alleles C and A, genotype C/A L 2 All l G d A t G/A Locus 2: Alleles G and A, genotype G/A Haplotype = set of alleles at different

    loci (inherited in a gamete from one parent)

    C G

    1 2

    (genetics)haplotypeOneics)(cytogenet chromosome One

    C

    A

    G

    A

    (genetics)haplotypeOne

    Other possible haplotypes:C-A, A-G

    10J. Ott "Statistical Genetics"

  • 6

    Genotypes and Haplotypes

    Locus 2Locus 1 G/G G/A A/A

    C/C C-G, C-G C-G, C-A C-A, C-AC/A C-G, A-G ? C-A, A-AA/A A-G, A-G A-G, A-A A-A, A-A

    orA-AG-CC G

    G -A A,-C

    orA -AG,-C?

    C

    A

    G

    A

    C A A G11J. Ott "Statistical Genetics"

    Counting Haplotypes

    Locus 1Locus 2

    G/G G/A A/A

    Known haplotypes

    No FreqNew

    countsLocus 1 G/G G/A A/A

    C/C 0 1 2

    C/A 0 1 2

    A/A 1 0 1

    No. Freq

    C-G 1 0.071

    C-A 7 0.500

    A-G 2 0.143

    A-A 4 0.286

    Total 14 1

    bi l f

    counts

    1.221

    7.779

    2.779

    4.221

    16*) Assumes HWE

    Ambiguous Frequency Rel. freq.1 C-G, 1 A-A 0.071 0.286 0.2211 C-A, 1 A-G 0.500 0.143 0.779

    Sum 0.092 1

    New counts0.221 C-G, 0.221 A-A0.779 C-A, 0.779 A-G

    1 1

    or

    12J. Ott "Statistical Genetics"

  • 7

    EM Algorithm

    Th it ti d h th i lid i The iterative procedure shown on the previous slides is known to lead to maximum likelihood estimates.

    EM algorithm: Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc 39B, 1-38.

    EM algorithm generally slower, but more stable, than Newton-Raphson and similar algorithms.

    Note: In practice, start with equal phase probabilities – the two possible pairs of haplotypes for doubly heterozygous individuals are given equal weight.

    13J. Ott "Statistical Genetics"

    Implementations

    EH program (Xie & Ott Am J Hum Genet abstr 1993) EH program (Xie & Ott, Am J Hum Genet, abstr, 1993) snphap computer program, David Clayton, Cambridge

    UK Estimation of haplotype frequencies by MLE using

    different starting values. For individuals with multiple phases, genotypes with probability < 0.01 disregarded.

    Assign (infer) haplotypes to individuals using MCMC approach (Gibbs sampling). Assumes a prior distribution

    hl f h l f(Dirichlet) of haplotype frequencies. Phase program: Somewhat better then others (Marchini

    et al [2006] Am J Hum Genet 78, 437-450), but modify default parameter values! (“p = 0.01”)

    14J. Ott "Statistical Genetics"

  • 8

    Example: LEPR GeneHoehe (2003) Pharmacogenomics 4, 547-70

    In 564 individuals, gene fully sequenced, g y q Found 83 SNPs Potential number of haplotypes = 283

    = 9.7 1024. Most common haplotypes with estimated

    frequencies:

    11111111111111111112111111211121112211111111111111111111111111111111111111111111111 0.106

    11111111111111111111111111111111111111111111111111111111111111111111111111111111111 0.081

    11111111111111111111111111111121211111111111111111121111211111212211111112111112112 0.078

    11111112121212111111112111211121112111111111111111111111212112111211111112111112111 0.056

    15J. Ott "Statistical Genetics"

    Estimation Results

    T t l f 851 h l t ti t d t Total of 851 haplotypes estimated to be present.

    Of these, 295 with f > 0.000,001 and 556 with f < 0.000,001

    Smallest “real” frequency: 564 2 1128 h 1/1128 n = 564 2n = 1128 haps 1/1128 =

    0.000,887

    16J. Ott "Statistical Genetics"

  • 9

    Hap Frequencies > 0.000,001

    0.04

    0.06

    0.08

    0.1

    0.12

    0

    0.02

    1 18 35 52 69 86 103

    120

    137

    154

    171

    188

    205

    222

    239

    256

    273

    290

    17J. Ott "Statistical Genetics"

    Enlargement

    Horizontal line: Many values of 0.000,887“R l” b f h l t 240?

    0.000840.000850.000860.000870.000880.000890.00090

    “Real” number of haplotypes 240?

    0.000800.000810.000820.00083

    140 150 160 170 180 190 200 210 220 230 240 250

    18J. Ott "Statistical Genetics"

  • 10

    Potential Solutions

    Work with assigned/inferred haplotypesg / p yp Not the same as multiplying haplotype

    frequencies by total number of haplotypes. Of the 1128 inferred haps, only 16 have

    assignment probabilities < 0.50. Total of 265 different haplotypes inferred,

    compared with 240 haplotypes with compared with 240 haplotypes with frequencies > 0.000,887

    Problem: Different assignment schemes are based on different priors different results.

    19J. Ott "Statistical Genetics"

    Number Hap cases prop. controls prop. OR 1/OR chisquare1 GCCIGCA 253 0 4765 608 0 4780 0 99 1 01 0 004

    Dataset from BeijingAssigned haplotypes, partial table

    1 GCCIGCA 253 0.4765 608 0.4780 0.99 1.01 0.0042 ATADATA 120 0.2260 263 0.2068 1.12 0.89 0.8285 ACCIGCA 12 0.0226 60 0.0472 0.47 2.14 5.89913 ACADATA 0 0 5 0.0039 0 inf 2.09314 GCADATT 0 0 5 0.0039 0 inf 2.09315 GCCIGTA 0 0 5 0.0039 0 inf 2.09326 ACCDGCA 0 0 1 0.0008 0 inf 0.41827 ACCIATT 0 0 1 0.0008 0 inf 0.41836 GCCDGCT 2 0.0038 0 0 inf 0 4.79637 GCCIGTT 1 0.0019 0 0 inf 0 2.39738 GTCDATT 1 0.0019 0 0 inf 0 2.39739 GTCIATT 1 0.0019 0 0 inf 0 2.397

    531 1 1272 1 25.832

    20J. Ott "Statistical Genetics"

  • 11

    Pooling Haplotypes

    Pearson ChiData

    Pearson chi-sq df table p Fisher p

    Chi-square table p

    No pooling of cells 53.75 38 0.0466 0.0217 64.24 0.0049Cells with 0 in one group and 1 in the

    other group are merged 53.75 28 0.0024 0.0016 64.24 0.0001Cells with 0 in one group are merged 53.75 21 0.0001

  • 12

    LD across genome

    4-gamete test: Pairs of adjacent SNPs. The more haplotypes, the smaller LD the larger the recombination intensitythe smaller LD, the larger the recombination intensity

    LEPR gene, sliding window of s SNPs:

    s SNPs:Max. # haps = 2s

    s 2s

    Number of haplo-types seen in 564

    Recomb. hot spot

    7 1285 323 8

    SNP number across gene

    indiv-iduals

    23J. Ott "Statistical Genetics"

    Haplotype Frequencies from Family DataTerwilliger & Ott, Handbook, section 23.3

    S l diSample pedigree

    To estimate haplotypefrequencies (1) based on founder individuals alone (EH program) and (2) jointly with recombination fraction (ILINK program).

    24J. Ott "Statistical Genetics"

  • 13

    Resulting Estimates

    ILINK results could be different from founder results, even with θ = 0, when some founders are not genotyped but their genotype can be inferred from offspring.

    25J. Ott "Statistical Genetics"

    Errors in Genotyping Data

    Meiosis

    Ott (1977) Clin Genet 12:119-24

    Meiosis

    R N

    p 1-p

    r 1-r• In a phase-known mating,

    can score offspring as k(apparent) recombinants and (n – k) non-recombinants.

    • P(R’) = r(1 – p) + (1 – r)p = r + p(1 – 2r)

    • P(R’) > P(R): Errors lead to p 1-p

    R' R'N' N'an overestimate of the recombination fraction and increased map length.

    26J. Ott "Statistical Genetics"

  • 14

    Simple Error Models for SNPsKeats, Sherman & Ott (1990) Cytogenet Cell Genet 55:387Lincoln & Lander (1992) Genomics 14:604

    27J. Ott "Statistical Genetics"

    The TDTSpielman, McGinnis, Ewens (1993) Am J Hum Genet 52, 506

    Focus on heterozygous parents T t h th B i d t Test whether B is passed on to

    child 50% of the time Null hypothesis: No linkage. Allows

    using multiple affected offspring. Powerful only in presence of association.

    J. Ott "Statistical Genetics" 28

  • 15

    Increase of false-positive rate in TDT due to genotype errorsHeath (1998) Am J Hum Genet 63 (suppl):A292

    29J. Ott "Statistical Genetics"

    TDTae Builds Errors into AnalysisGordon et al. (2001) Am J Hum Genet 69:371

    30J. Ott "Statistical Genetics"

  • 16

    Interpreting R1 and R2

    Strategy: Run TDTae under each of the three

    J. Ott "Statistical Genetics" 31

    Strategy: Run TDTae under each of the three inheritance models and collect results.

    Plausible values of R1 and R2B

    Biologically reasonable values in triangle A-B-C

    J. Ott "Statistical Genetics" 32

    AC

  • 17

    Genome-wide: TDTaeGordon et al (2004) Eur J Hum Genet 12, 752-761

    89 trio families, 1,140,419 SNPs. After QC (MAF > 0.05, call rate > 96%): 762,867 SNPs

    Run plink to find Mendel errors Display number N of errors by

    family Table: 20 families with largest

    numbers of errors

    J. Ott "Statistical Genetics" 33

    Initial analyses with plink

    J. Ott "Statistical Genetics" 34

  • 18

    Results

    J. Ott "Statistical Genetics" 35

    AGRE Families, autism ~400,000 SNPs. 695 families ranked by number of

    M d l Mendel errors

    36J. Ott "Statistical Genetics"