56
Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences Epidemiology & Biostatistics Animal Sciences Bruce Walsh [email protected]

Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Embed Size (px)

Citation preview

Page 1: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Bayesian Estimators ofTime to Most Recent Common Ancestry

Ecology and Evolutionary Biology

Adjunct Appointments Molecular and Cellular Biology

Plant SciencesEpidemiology & Biostatistics

Animal Sciences

Bruce Walsh

[email protected]

Page 2: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

DefinitionsMRCA - Most Recent Common Ancestor

TMRCA - Time to Most Recent Common Ancestor

Question: Given molecular marker information from a pair of individuals, what is the estimated time backto their most recent common ancestor?

With even a small number of highly polymorphicautosomal markers, trivial to assess zero (subject/biological sample) and one (parent-offspring) MRCA

Page 3: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Problems with Autosomal Markers

Often we are very interested in MRCAs that are modest(5-10 generations) or large (100’s to 10,000’s of generations)

Unlinked autosomal markers simply don’t work over thesetime scales.

Reason: IBD probabilities for individuals sharing a MRCA5 or more generations ago are extremely small and hencevery hard to estimate (need VERY large number of markers).

Page 4: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

MRCA-I vs. MRCA-G

We need to distinguish between the MRCA for a pairof individuals (MRCA-I) and the MRCA for a particulargenetic marker G (MRCA-G).

MRCA-G varies between any two individuals over recombination units.

For example, we could easily have for a pair of relativesMRCA (mtDNA ) = 180 generationsMRCA (Y ) = 350 generationsMRCA (one -globulin allele ) = 90 generationsMRCA (other -globulin allele ) = 400 generations

Page 5: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

MRCA-G > MRCA-I

MRCA-I

lost

MRCA-G( )

MRCA-G( )

Page 6: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

mtDNA and Y Chromosomes

So how can we accurately estimate TMRCA for modestto large number of generations?

Answer: Use a set of completely linked markers

With autosomes, unlinked markers assort each generationleaving only a small amount of IBD information on eachmarker, which we must then multiply together. IBD information decays on the order of 1/2 each generation.

With completely linked marker loci, information on IBD does not assort away via recombination. IBD information decay is on the order of the mutation rate.

Page 7: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Y chromosome microsatellitemutation rates- I

Estimate of u Source Reference

0.0028 Y chromosome Kayser et al. 2000

0.0021 Y chromosome Heyer et al. 1997

0.001 - 0.0021 Autosomalchromosomes

Wong & Weber 1993Brinkmann 1998

Estimates of human mutation rate in microsatellitesare fairly consistent over both the Y and the autosomes

Page 8: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Basic Structure of the Problem

What is the probability that the two marker alleles at a haploid locus from two related individuals agree given that their MRCA was t generation ago?

Phrased another way, what is their probabilityof identity in state (IBS), given they are identicalby descent (IBD) when their TMRCA is t generations

Page 9: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Infinite Alleles Model

The first step in answering this question is to assume a particular mutational model

Our (initial) assumption will be the infinite allelesmodel (IAM)

The key assumption of this model (originally due to Kimura and Crow, 1964) is that each new mutationgives rise to a new allele.

The IAM was the first population-genetics model toattempt to formally incorporate the structure of DNAinto a model

Page 10: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Key: Under the infinite alleles, two alleles that areidentical in state that are also ibd have notexperienced any mutations since their MRCA.

Let q(t) = Probability two alleles with a MRCAt generations ago are identical in state

If u = per generation mutation rate, then

q(t) = (1-u)2t

MRCA

(1-u) t

A

(1-u)t

B

MRCA

Pr(No mutation from MRCA->A) = (1-u)t

Pr(No mutation from MRCA->B) = (1-u)t

Page 11: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

q(t) = (1-u)2t ≈ e-2ut = e-, = 2ut

Building the Likelihood Function for n Loci

For any single marker locus, the probability of IBSgiven a TMRCA of t generations is

The probability that k of n marker loci are IBS is justPr(k)=n!(n°k)!k!q(t)k[1°q(t)]n°k-- -

a Binomial distribution with success parameter q(t)L(tjn;k)=n!(n°k)!k!e°kø°1°e°ø¢n°k--- --( )

Likelihood function for t given k of n matches

Page 12: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

ML Analysis of TMRCAL(tjn;k)=n!(n°k)!k!e°kø°1°e°ø¢n°k--- --( )

It would seem that we now have all the pieces inhand for a likelihood analysis of TMRCA giventhe marker data (k of n matches)

Likelihood function ( = 2ut)

MLE for t is solution of ∂ L/∂t = 0

p = fraction of matches

ø=2tπ=ln≥nk¥=lnµ1p∂=°ln(p)-( ) ( )^ ^

Page 13: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

In particular, the MLE for t becomes

Likewise, the precision of this estimator followsfor the (negative) inverse of the 2nd derivativeof the log-likelihood function evaluated at theMLE,

(bt=12πln≥nk¥=°12πln(p)) -°µ@2lnL(tjn;k)@t2ØØØØt=t∂°1=14u21nµ1°pp∂Var( t ) =--( )^^ -

Page 14: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Likewise, we can (numerically) easily find 1-LOD support intervals for t and hence constructapproximate 95% confidence intervals to TMRCA

Finally, hypothesis testing, say Ho: MRCA = t0, is easily accomplished by comparing -2* the naturallog of the ratio of the value of the likelihood function at t = t0 over the value of the likelihood function at the MLE t = t ̂

The resulting log likelihood ratio LR is (asymptotically)distributed as a chi-square distribution with one degree of freedom

Page 15: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Trouble in Paradise

The ML machinery has seem to have done its job,giving us an estimate, its approximate samplingerror, approximate confidence intervals, and a schemefor hypothesis testing.

Hence, all seems well.

Problem: Look at k=n (= complete match at all markers).

MLE (TMRCA) = 0 (independent of n)

Var(MLE) = 0 (ouch!)

Page 16: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

With n=k, the value of the likelihood function is

L(t) = (1-u)2tn ≈ e-2tun

What about one-LOD support intervals (95% CI) ?

L has a maximum value of one under the MLE

Hence, any value of t that gives a likelihood value of0.1 or larger is in the one-LOD support interval

Solving, the one-LOD support interval is from t=0 to t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈ (1/n) [ Ln(10)/(2u) ]

For u = 0.002, CI is (0, 575/n)

Page 17: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

With n=k, likelihood function reduces to

L(t) = (1-u)2tn ≈ e-2tun

t

L(t)

(Plots for u = 0.002)

MLE(t) = 0 for all values on nn=5

n=10

n=20

0.1 of max value (1) oflikelihood function

1 LOD ≈ t = 291 LOD ≈ t = 58

1 LOD ≈ t = 115

Page 18: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

What about Hypothesis testing?Again recall that for k =n that the likelihood at t = t0 is

L(t0) ≈ Exp(-2t0un)

Hence, the LR test statistic for Ho: t = t0 is just LR = -2 ln [ L(t0)/ L(0) ]

= -2 ln [ Exp(-2t0un) / 1 ] = 4t0un

Thus the probability for the test that TMRCA = t0 is just Pr( 1

2 > 4t0un)

Page 19: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

The problem(s) with MLThe expressions developed for the sampling variance,approximate confidence intervals, and hypothesistesting are all large-sample approximations

Problem 1: Here our sample size is the number ofmarkers scored in the two individuals. Not likely tobe large.

Problem 2: These expressions are obtained by takingappropriate limits of the likelihood function. If theML is exactly at the boundary of the admissible spaceon the likelihood surface, this limit may not formallyexist, and hence the above approximations are incorrect.

Page 20: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

The solution?

“Ain’t Too Proud to Bayes” -- Brad Carlin

Page 21: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Why Go Bayesian

An extension of likelihood is Bayesian statistics

p( | x) = C * l(x | ) p()

Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter given the data x

posterior distribution ofgiven x

Likelihood function for Given the data x

prior distribution for The appropriate constant so that the posteriorintegrates to one.

Why Bayesian?

• Exact for any sample size

• Marginal posteriors• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

Page 22: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

The Prior on TMRCAThe first step in any Bayesian analysis is choice ofan appropriate prior distribution p(t) -- our thoughts onthe distribution of TMRCA in the absence of any ofthe marker data

Standard approach: Use a flat or uninformative prior,with p(t) = a constant over the admissible range of theparameter. Can cause problems if the likelihood functionis unbounded (integrates to infinity)

In our case, population-genetic theory provides theprior: under very general settings, the time to MRCA for a pair of individuals follows a geometric distribution

Page 23: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

In particular, for a haploid gene, TMRCA followsa geometric distribution with mean 1/Ne.

Hence, our prior is just

p(t) = (1-)t ≈ e-t, where = 1/Ne

Hence, we can use an exponential prior withhyperparameter (the parameter fullycharacterizing the distribution) = 1/Ne.The posterior thus becomesp(tjk)/L(tjn;k)p(t)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°k- - - -

Previous likelihood function (ignoring constantsthat cancel when we compute the normalizing factorC)

Prior

Prior hyperparameter = 1/Ne

Page 24: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

The Normalizing constantp(tjk)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°kI(π;k;n;∏)- - - -

whereI(π;k;n;∏)=Z10exp[°(2πk+∏)t](1°exp[°(2πt)])n°kdt- - - -

I ensures that the posterior distribution integratesto one, and hence is formally a probability distribution

Page 25: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

What is the effect of the hyperparameter?p(tjk)=exp[°(2πk+∏)t](1°exp[°(2πt)])n°kI(π;k;n;∏)- - - -

If 2uk >> , then essentially no dependence on theactual value of chosen.

Hence, if 2Neuk >> 1, essentially no dependence on(hyperparameter) assumptions of the prior.

For a typical microsatellite rate of u = 0.002, this is justNek >> 250, which is a very weak assumption. For example,with k =10 matches, Ne >> 25. Even with only 1 match (k=1),just require Ne >> 250.

Page 26: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Closed-form Solutions for the Posterior Distribution

Complete analytic solutions to the prior can be obtainedby using a series expansion (of the (1-ex)n term) to giveexp[°(2πk+∏)t](1°exp[°(2πt)])n°k- - -

-exp[°(2πk+∏)t]√n°kXi=0(°1)i(n°k)!i!(n°k°i)!exp[°(2πti)]!- - -- ---

(=

-=n°kXi=0(°1)i(n°k)!i!(n°k°i)!exp[°(2π(k+i)+∏)t]---

--

Each term is just a * ebt, which is easily integrated

Page 27: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

I(π;k;n;∏)=n°kXi=0(°1)i(n°k)!i!(n°k°i)!Z10exp[°(2π(k+i)+∏)t]dt=n°kXi=0(°1)i(n°k)!i!(n°k°i)!12π(k+i)+∏=2n°k(n°k)!πn°kn°ki=0[∏+2π(n°i)]Q- -

-

-- -

-

-- -

--

-

-

- -

With the assumption of a flat prior, = 0, this reduces toI(π;k;n;0)=(n°k)!(k°1)!(2π)n!- -

Page 28: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Hence, the complete analytic solution of the posterior is

Suppose k = n (no mismatches)

p(tjk;∏)=√Qn°ki=0[∏+2π(n°i)]2n°k(n°k)!πn°k!(1°exp[°2πt])n°kexp[t(2πk+∏)]- - --

-

- -

-(

In this case, the prior is simply an exponentialdistribution with mean 2un + ,p(tjk=n)=(∏+2nπ)exp[°(2πn+∏)t]-

Page 29: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Analysis of n = k case

Mean TMRCA and its variance:πt=æt=1∏+2nπ'12nπPr(t∑T)=ZT0p(tjk=n)dt=1°exp(°(2πn+∏)T)< --

Cumulative probability:

In particular, the time T satisfying P(t < T) = is TÆ=°ln(1°Æ)2πn+∏--

Page 30: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

For a flat prior ( = 0), the 95% (one-side) credibleinterval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu)

Hence, under a Bayesian analysis for u = 0.02, the95% upper credible interval is given by ≈ 749/n

Recall that the one-LOD support interval (approximate95% CI) under an ML analysis is ≈ 575/n

The ML solution’s asymptotic approximation significantlyunderestimates the true interval relative to theexact analysis under a full Bayesian model

Page 31: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Why the difference?

Under ML, we plot the likelihood function and lookfor the 0.1 value

Under a Bayesian analysis, we lookat the posterior probabilitydistribution (likelihood adjustedto integrate to one) and find thevalues that give an area of 0.95

n = 20, area toleft of t=38 = 0.95

n = 10, area toleft of t=75 = 0.95

t

Pr(

TM

RC

A <

t)

n = 20, t0.95 = 38 n = 10, t0.95 = 75

Page 32: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

3002502001501005000.00

0.01

0.02

0.03

Time t to MRCA

p( t | k )

10

9

87

65

Posteriors for n = 10

Sample Posteriors for u = 0.002

10090807060504030201000.00

0.02

0.04

0.06

0.08

Time t to MRCA

p( t | k )

20

1918

1715

16

Posteriors for n = 20

40353025201510500.00

0.05

0.10

0.15

0.20

Time t to MRCA

p( t | k )

50

4948 47 46

45

Posteriors for n = 50

20191817161514131211109876543210.00

0.05

0.10

0.15

0.20

0.25

0.30

Time t to MRCA

p( t | k )

100

99

98

9695

97

n = 100

65605550454035302520151050

0.000

0.010

0.020

0.030

0.040

0.050

0.060

Time t to MRCA

p( t | k )

9493

9291

90

89

n = 100

Page 33: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Key points

• By using the appropriate number of markers wecan get accurate estimates for TMRCA for evenjust a few generations. 20-50 markers will do.

• By using markers on a non-recombining chromosomalsection, we can estimate TMRCA over much, muchlonger time scales than with unlinked autosomal markers

• Hence, we have a fairly large window of resolutionfor TMRCA when using a modest number of completelylinked markers.

Page 34: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Extensions I: Different Mutation Rates

Let marker locus k have mutation rate uk.

Code the observations as xk = 1 if a match, otherwise code xk = 0p(tjx)/exp"°t√∏+2nXk=1πkxk!#nYk=1£1°e°2tπi§1°xk- ( [ ] -- -

The posterior becomes:

Page 35: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Stepwise Mutation Model (SMM) The Infinite alleles model (IAM) is not especiallyrealistic with microsatellite data, unless the fractionof matches is very high.

Under IAM, score as a match, and hence no mutationsIn reality, there are two mutations

Microsatellite allelic variants are scored by their numberof repeat units. Hence, two “matching” alleles can actuallyhide multiple mutations (and hence more time to the MRCA)

Mutation 1

Mutation 2

Page 36: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Y chromosome microsatellitemutation rates- II

The SMM model is an attempt to correct formultiple hits by fully accounting for the mutationalstructure.

Good fit to array sizes in natural populations when assuming the symmetric single-step model • Equal probability of a one-step move up or down

In direct studies of (Y chromosome) microsatellites35 of 37 dectected mutations in pedigrees weresingle step, other 2 were two-step

Page 37: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

SMM0 model -- match/no match under SMM

The simplest implementation of the SMM model isto simply replace the match probabilities q(t) underthe IAM model with those for the SMM model.

This simply codes the marker loci as match / no match

We refer to this as the SMMO model

Page 38: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Formally, the SMM model assumes the followingtransition probabilitiesPr(X(t+1)=i°1jX(t)=i)=Pr(X(t+1)=i+1jX(t)=i)=π2Pr(X(t+1)=ijX(t)=i)=1°πPr(jX(t+1)°X(t)j∏2jX(t)=i)=0-

-

- >

Note that two alleles can match only if they haveexperienced an even number of mutations in total betweenthem. In such cases, the match probabilities becomePr(matchj2Mmoves)=122Mµ2MM∂=122M(2M)!(M!)2( )

Page 39: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Pr(matchj2Mmoves)=122Mµ2MM∂=122M(2M)!(M!)2( )Number ofmutations

Prob(Match)

2 0.500

4 0.375

6 0.313

8 0.273

10 0.246

q(t)=1XM=0Pr(matchj2Mmoves)Pr(2Mmovesjt)=1XM=0µ122M(2M)!(M!)2∂µ(2πt)2M(2M)!∂exp(°2tπ)=exp(°2tπ)√1XM=0(πt)2M(M!)2!-

- (1X=0(x)2k(k!)2=I0(2x)kThe zero-order modifedType I Bessel Function

Hence,q(ø)=exp(°ø)I0(ø)-

Page 40: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

= 2μt

q(t)

Infinite-alleles

stepwise

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

= 2μt

q(t) Infinite-alleles

stepwise

Page 41: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Under the SMM model, the prior hyperparametercan now become important.

This is the case when the number n of markers issmall and/or k/n is not very close to one

Why? Under the prior, TMRCA is forced by ageometric with 1/Ne. Under the IAM model formost values this is still much more time thanthe likelihood function predicts from the marker data

Under the SMM model, the likelihood alone predictsa much longer value so that the forcing effect of theinitial geometric can come into play

Page 42: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

n =5, k = 3, u = 0.02

Time, t

Pr(

TM

RC

A <

t)

IAM, both flat prior and Ne = 5000

SSMO, Ne = 5000

SSMO, flat prior

Prior with Ne =5000

Page 43: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

An Exact Treatment: SMME

With a little work we can show that the probabilitytwo sites differ by j steps is justq(j)(ø)=2exp(°ø)Ij(ø)forj∏1- >

The resulting likelihood thus becomesL(tjn0;¢¢¢;nk)=n!n0!n1!¢¢¢nk!kYj=0hq(j)(2πt)inj……

Where nj is the number of sites that differby k (observed) steps

The jth-order modifedType I Bessel Function

Page 44: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

With this likelihood, the resulting posterior becomesp(tjn0;¢¢¢;nk)/kYj=0hq(j)(2πt)inje°∏t… -

This rearranges to give the general posteriorunder the Exact SMM model (SMME) asp(tjn0;¢¢¢;nk)=e°(∏+2πn)tQkj=0[Ij(2πt)]njR10e°(∏+2πn)tQkj=0[Ij(2πt)]njdt-

-…

Number of exact matchesNumber of k steps differences

Page 45: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Example

Consider comparing haplotypes 1 and 3 from Thomaset al.’s (2000) study on the Lemba and Cohen Y chromosome modal haplotypes. Here six markers used,four exactly match, one differs by one repeat, the otherby two repeats

Hence, n = 6, k = 4 for IAM and SMM0 models

n0 = 4, n1 = 1, n2 = 1, n = 6 under SMME model

Assume Hammer’s value of Ne=5000 for the prior

Page 46: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

IAM

SMM0

SMME

Time to MRCA, t

P(t

| m

ark

ers

)TMRCA for Lemba and Cohen Y

Model used Mean Medium 2.5% 97.5%

IAM 152.3 135.4 31.1 371

SMM0 454.7 233.7 40.4 2389

SMME 422.3 286.2 65.1 1631

Page 47: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Time to MRCA, t

Pr(

TM

RC

A <

t)

IAM

SMM0

SMME

Page 48: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences
Page 49: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Technology Transfer

Family Tree DNA (ftDNA) -- provides Y chromosome marker kits for genealogical studies

So far, ftDNA has processed over 80,000 suchkits

This amounts to a rough gross of around 8 milliondollars.

The expressions developed above have directcommercial applications

Page 50: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Forensic applications of the Y

• A not uncommon situation is the only DNA is from fingernail scrappings.

• The result is a mixture wherein the victim's DNA often overwhelms the DNA of the perpetrator.

• Result: only modest match probability as many autosomal markers cannot be detected

• One solution: use Y chromosome markers. Easily amplified over (female) background

Page 51: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Problem: How do we combine Y match with autosomal match?

NRC 1996 recommendations (autosomal loci)

Prob(Y match)*Prob(autosomal match)

Problem: Y markers may provide information about population substructure membership

For example, a particular haplotype may be restricted to a certain subpopulation, e.g., Native Americans

Product rule across markers

Population substructure correction within markers

Page 52: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Correcting for Y substructure

Let y denote the observed Y haplotypeA the multilocus autosomal marker genotype

P(y,A) = P(A | y)*P(y)

Simple approach: knowledge of y indicates membershipin a particular subpopulation, P(A) computed using allele frequencies for that subpopulation.

Suggestion: Multiply freq(y)* max Freq(A over subgroups)

Page 53: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

A more precise accounting

Suppose two individuals share the same y haplotype.What is there average coancestry, ?

Balding and Nichols give expressions for autosomalsingle-locus genotype frequencies given that thepopulation shows structure with coancestry .

Second approach: Compute from haplotype matching.Using this value in Balding - Nichols expressionsto compute (single-locus) autosomal frequencies.

Page 54: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

P(tjt∏k)=(1°u)2nø¢(1°∏)ø°1P1ø=k(1°u)2nø¢(1°∏)ø°1>

Posterior Distribution for a match at alln markers with a prior of = 1/NeE[µjt∏k]=1Xt=kµ122t+1∂P(tjt∏k)> >

With a MRCA of t generations, = (1/2)2t+1

Page 55: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Typical situation is where we can exclude father-son and paternal half-sibs, k > 2nπ+2nπ)E[µjt∏2]'∏+2nπ24°8(∏+2nπ)>

Typical values, n = 11, = 1/500

• For = 1/5000, E [ ] = 0.00186

• For = 1/500, E [ ] = 0.00194

• For = 1/50, E [ ] = 0.00272

For these values, unless pi < 0.01, Balding-Nichols expression are essentially HW.

Page 56: Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences

Formal procedure

Estimate P(y) from a database (counting methods,Bayesian estimators)

Compute mutlilocus autosomal frequencies byeach major ethnic group using the product of the single-locus genotypes computed using group-specific allele frequencies and = 0.002correction.

Conservative P(y,A) = P(y)*max P(A)