Click here to load reader

Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

  • View
    212

  • Download
    0

Embed Size (px)

Text of Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

  • Slide 1
  • Realistic evolutionary models Marjolijn Elsinga & Lars Hemel
  • Slide 2
  • Realistic evolutionary models Contents Models with different rates at different sites Models which allow gaps Evaluating different models Break Probabilistic interpretation of Parsimony Maximum Likelihood distances
  • Slide 3
  • Unrealistic assumptions 1 Same rate of evolution at each site in the substitution matrix - In reality: the structure of proteins and the base pairing of RNA result in different rates 2 Ungapped alignments - Discard useful information given by the pattern of deletions and insertions
  • Slide 4
  • Different rates in matrix Maximum likelihood, sites are independent X j for j = 1n
  • Slide 5
  • Different rates in matrix (2) Introduce a site-dependent variable r u
  • Slide 6
  • Different rates in matrix (3) We dont know r u, so we use a prior Yang [1993] suggests a gamma distribution g(r, , ), with mean = 1 and variance = 1/
  • Slide 7
  • Problem Number of terms grows exponentially with the number of sequences computationally slow Solution: approximation - Replace integral by a discrete sum - Subdivide domain into m intervals - Let r k denote the mean of the gamma distribution in the kth interval
  • Slide 8
  • Solution Yang [1993] found m = 3.4 gives a good approximation Only m times as much computation as for non-varying sites
  • Slide 9
  • Evolutionary models with gaps (1) Idea 1: introduce _ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix Drawback: no possibility to assign lower cost to a following gap, gaps are now independent
  • Slide 10
  • Evolutionary models with gaps (2) Idea 2: Allison, Wallace & Yee [1992] introduce delete and insertion states to ensure affine-type gaps Drawback: computationally intractable
  • Slide 11
  • Evolutionary models with gaps (3) Idea 3: Thorne, Kishino & Felsenstein [1992] use fragment substitution to get a degree of biological plausibility Drawback: usable for only two sequences
  • Slide 12
  • Finally Find a way to use affine-type gap penalties in a computationally reasonable way Mitchison & Durbin [1995] made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change
  • Slide 13
  • Assumptions needed again We will use a architecture quite simpler than that of the profile HMM of Krogh et al [1994]: it has only match and delete states Match state: M k Delete state: D k k = position in the model
  • Slide 14
  • Tree HMM with gaps (1) Sequence y is ancestor of sequence x Both sequences are aligned to the model, so both follow a prescribed path through the model
  • Slide 15
  • Tree HMM with gaps (2) x emits residu x i at M k y emits residu y j at M k Probability of substitution y j x i is P(x i | y j,t)
  • Slide 16
  • Tree HMM with gaps (3) What if x goes a different path than y x: M k D k+1 (= MD) y: M k M k+1 (= MM) P(MD|MM, t)
  • Slide 17
  • Tree HMM with gaps (4) x: D k+1 M k+2 (= DM) y: M k+1 M k+2 (= MM) We assume that the choice between DD and DM is controlled by a mutational process that operates independently from y
  • Slide 18
  • Substitution matrix The probabilities of transitions of the path of x are given by priors: D k+1 M k+2 has probability q DM
  • Slide 19
  • How it works At position k: q yj P(x i |y j,t) Transition k k+1: q MM P(MD|MM,t) Transition k+1 k+2: q MM q DM
  • Slide 20
  • An other example
  • Slide 21
  • Evaluating models: evidence Comparing models is difficult Compare probabilities: P(D|M 1 ) and P(D|M 2 ) by integrating over all parameters of each model Parameters Prior probabilities P( )
  • Slide 22
  • Comparing two models Natural way to compare M 1 and M 2 is to compute the posterior probability of M 1
  • Slide 23
  • Parametric Bootstrap Let be the maximum likelihood of the data D for the model M 1 Let be the maximum likelihood of the data D for the model M 2
  • Slide 24
  • Parametric bootstrap (2) Simulate datasets D i with the values of the parameters of M 1 that gave the maximum likelihood for D If exceed almost all values of i M 2 captured more aspects of the data that M 1 did not mimic, therefore M 1 is rejected
  • Slide 25
  • Break
  • Slide 26
  • Probabilistic interpretation of various models Lars Hemel
  • Slide 27
  • Overview Review of last weeks method Parsimony Assumptions, Properties Probabilistic interpretation of Parsimony Maximum Likelihood distances Example: Neighbour joining More probabilistic interpretations Sankoff & Cedergren Heins affine cost algorithm Conclusion / Questions?
  • Slide 28
  • Review Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions
  • Slide 29
  • Parsimony Remember the following assumptions: Sequences are aligned Alignments do not have gaps Each site is treated independently Further more, many families have: Substitution matrix is multiplicative: Reversibility:
  • Slide 30
  • Parsimony Basic step: counting the minimal number of changes for one site Final number of substitutions is summing over all the sites Weighted parsimony uses different weights for different substitutions
  • Slide 31
  • Probabilistic interpretation of parsimony Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length t Calculate substitution costs S(a,b) = -log P(b|a) Felsenstein [1981] showed that by using these substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood
  • Slide 32
  • Probabilistic interpretation of parsimony Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly Sampling is done as follows: Pick a residue a at the root with probability Accept substitution to b along the edge down to node i with probability repetitive Sequences of length N are generated by N independent repetitions of this procedure Maximum likelihood should reconstruct the correct tree for large N
  • Slide 33
  • Probabilistic interpretation of parsimony Suppose we have tree T, with the following edgelengths 0.09 0.1 0.3 And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and 4 1 2 4 3
  • Slide 34
  • Probabilistic interpretation of parsimony Tree with n leaves has (2n-5)!! unrooted trees 1 2 3 4 1 2 4 3 12 3 4
  • Slide 35
  • Probabilistic interpretation of parsimony Parsimony can constructs the wrong tree even for large N N 20419339242 100638204158 5009046135 200099730 N 20396378224 10040551579 5004045942 20003536460 Parsimony Maximum likelihood
  • Slide 36
  • Probabilistic interpretation of parsimony Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4 A A B B
  • Slide 37
  • Probabilistic interpretation of parsimony With parsimony the number of substitutions are calculated AA B B A A B B A A A B 2 1 Parsimony constructs the right tree with 1 substitution more often than the left tree with 2
  • Slide 38
  • Maximum Likelihood distances Suppose tree T, edge lengths and sampled sequences at the leafs Well try to compute the distance between and
  • Slide 39
  • By multiplicativety Maximum Likelihood distances
  • Slide 40
  • By reversibility and multiplicativity
  • Slide 41
  • Maximum Likelihood distances
  • Slide 42
  • ML distances between leaf sequences are close to additive, given large amount of data
  • Slide 43
  • Example: Neighbour joining i j k m
  • Slide 44
  • Use Maximum Likelihood distances Suppose we have a multiplicative reversible model Suppose we have plenty of data The underlying probabilistic model is correct Then Neighbour joining will construct any tree correctly.
  • Slide 45
  • Example: Neighbour joining Neighbour joining using ML distances It constructs the correct tree where Parsimony failed N 20477301222 100635231134 5008968519 200099750
  • Slide 46
  • More probabilistic interpretations Sankoff & Cedergren Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. Allison, Wallace & Yee [1992] But as the original S&C method it is not practical for most problems.
  • Slide 47
  • More probabilistic interpretations Heins affine cost algorithm Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties Probabilistic when scores are interpreted as log probabilities and if the

Search related