# Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

• View
212

0

Embed Size (px)

### Text of Realistic evolutionary models Marjolijn Elsinga & Lars Hemel

• Slide 1
• Realistic evolutionary models Marjolijn Elsinga & Lars Hemel
• Slide 2
• Realistic evolutionary models Contents Models with different rates at different sites Models which allow gaps Evaluating different models Break Probabilistic interpretation of Parsimony Maximum Likelihood distances
• Slide 3
• Unrealistic assumptions 1 Same rate of evolution at each site in the substitution matrix - In reality: the structure of proteins and the base pairing of RNA result in different rates 2 Ungapped alignments - Discard useful information given by the pattern of deletions and insertions
• Slide 4
• Different rates in matrix Maximum likelihood, sites are independent X j for j = 1n
• Slide 5
• Different rates in matrix (2) Introduce a site-dependent variable r u
• Slide 6
• Different rates in matrix (3) We dont know r u, so we use a prior Yang  suggests a gamma distribution g(r, , ), with mean = 1 and variance = 1/
• Slide 7
• Problem Number of terms grows exponentially with the number of sequences computationally slow Solution: approximation - Replace integral by a discrete sum - Subdivide domain into m intervals - Let r k denote the mean of the gamma distribution in the kth interval
• Slide 8
• Solution Yang  found m = 3.4 gives a good approximation Only m times as much computation as for non-varying sites
• Slide 9
• Evolutionary models with gaps (1) Idea 1: introduce _ as an extra character of the alphabet of K residues and replace the (KxK) matrix with a (K+1) x (K+1) matrix Drawback: no possibility to assign lower cost to a following gap, gaps are now independent
• Slide 10
• Evolutionary models with gaps (2) Idea 2: Allison, Wallace & Yee  introduce delete and insertion states to ensure affine-type gaps Drawback: computationally intractable
• Slide 11
• Evolutionary models with gaps (3) Idea 3: Thorne, Kishino & Felsenstein  use fragment substitution to get a degree of biological plausibility Drawback: usable for only two sequences
• Slide 12
• Finally Find a way to use affine-type gap penalties in a computationally reasonable way Mitchison & Durbin  made a tree HMM which uses a profile HMM architecture, and treats paths through the model as objects that undergo evolutionary change
• Slide 13
• Assumptions needed again We will use a architecture quite simpler than that of the profile HMM of Krogh et al : it has only match and delete states Match state: M k Delete state: D k k = position in the model
• Slide 14
• Tree HMM with gaps (1) Sequence y is ancestor of sequence x Both sequences are aligned to the model, so both follow a prescribed path through the model
• Slide 15
• Tree HMM with gaps (2) x emits residu x i at M k y emits residu y j at M k Probability of substitution y j x i is P(x i | y j,t)
• Slide 16
• Tree HMM with gaps (3) What if x goes a different path than y x: M k D k+1 (= MD) y: M k M k+1 (= MM) P(MD|MM, t)
• Slide 17
• Tree HMM with gaps (4) x: D k+1 M k+2 (= DM) y: M k+1 M k+2 (= MM) We assume that the choice between DD and DM is controlled by a mutational process that operates independently from y
• Slide 18
• Substitution matrix The probabilities of transitions of the path of x are given by priors: D k+1 M k+2 has probability q DM
• Slide 19
• How it works At position k: q yj P(x i |y j,t) Transition k k+1: q MM P(MD|MM,t) Transition k+1 k+2: q MM q DM
• Slide 20
• An other example
• Slide 21
• Evaluating models: evidence Comparing models is difficult Compare probabilities: P(D|M 1 ) and P(D|M 2 ) by integrating over all parameters of each model Parameters Prior probabilities P( )
• Slide 22
• Comparing two models Natural way to compare M 1 and M 2 is to compute the posterior probability of M 1
• Slide 23
• Parametric Bootstrap Let be the maximum likelihood of the data D for the model M 1 Let be the maximum likelihood of the data D for the model M 2
• Slide 24
• Parametric bootstrap (2) Simulate datasets D i with the values of the parameters of M 1 that gave the maximum likelihood for D If exceed almost all values of i M 2 captured more aspects of the data that M 1 did not mimic, therefore M 1 is rejected
• Slide 25
• Break
• Slide 26
• Probabilistic interpretation of various models Lars Hemel
• Slide 27
• Overview Review of last weeks method Parsimony Assumptions, Properties Probabilistic interpretation of Parsimony Maximum Likelihood distances Example: Neighbour joining More probabilistic interpretations Sankoff & Cedergren Heins affine cost algorithm Conclusion / Questions?
• Slide 28
• Review Parsimony = Finding a tree which can explain the observed sequences with a minimal number of substitutions
• Slide 29
• Parsimony Remember the following assumptions: Sequences are aligned Alignments do not have gaps Each site is treated independently Further more, many families have: Substitution matrix is multiplicative: Reversibility:
• Slide 30
• Parsimony Basic step: counting the minimal number of changes for one site Final number of substitutions is summing over all the sites Weighted parsimony uses different weights for different substitutions
• Slide 31
• Probabilistic interpretation of parsimony Given: A set of substitution probabilities P(b|a) in which we neglect the dependence on length t Calculate substitution costs S(a,b) = -log P(b|a) Felsenstein  showed that by using these substitution costs, the minimal cost at site u for the whole tree T obtained by the weighted parsimony algorithm is regarded as an approximation to the likelihood
• Slide 32
• Probabilistic interpretation of parsimony Testing performance for tree-building algorithms can be done by generating trees probabilistic with sampling and then see how often a given algorithm reconstructs them correctly Sampling is done as follows: Pick a residue a at the root with probability Accept substitution to b along the edge down to node i with probability repetitive Sequences of length N are generated by N independent repetitions of this procedure Maximum likelihood should reconstruct the correct tree for large N
• Slide 33
• Probabilistic interpretation of parsimony Suppose we have tree T, with the following edgelengths 0.09 0.1 0.3 And substitutionmatrix with p=0.3 for leaves 1,3 and p=0.1 for 2 and 4 1 2 4 3
• Slide 34
• Probabilistic interpretation of parsimony Tree with n leaves has (2n-5)!! unrooted trees 1 2 3 4 1 2 4 3 12 3 4
• Slide 35
• Probabilistic interpretation of parsimony Parsimony can constructs the wrong tree even for large N N 20419339242 100638204158 5009046135 200099730 N 20396378224 10040551579 5004045942 20003536460 Parsimony Maximum likelihood
• Slide 36
• Probabilistic interpretation of parsimony Suppose the following example: A tree with A,A,B,B at the places 1,2,3 and 4 A A B B
• Slide 37
• Probabilistic interpretation of parsimony With parsimony the number of substitutions are calculated AA B B A A B B A A A B 2 1 Parsimony constructs the right tree with 1 substitution more often than the left tree with 2
• Slide 38
• Maximum Likelihood distances Suppose tree T, edge lengths and sampled sequences at the leafs Well try to compute the distance between and
• Slide 39
• By multiplicativety Maximum Likelihood distances
• Slide 40
• By reversibility and multiplicativity
• Slide 41
• Maximum Likelihood distances
• Slide 42
• ML distances between leaf sequences are close to additive, given large amount of data
• Slide 43
• Example: Neighbour joining i j k m
• Slide 44
• Use Maximum Likelihood distances Suppose we have a multiplicative reversible model Suppose we have plenty of data The underlying probabilistic model is correct Then Neighbour joining will construct any tree correctly.
• Slide 45
• Example: Neighbour joining Neighbour joining using ML distances It constructs the correct tree where Parsimony failed N 20477301222 100635231134 5008968519 200099750
• Slide 46
• More probabilistic interpretations Sankoff & Cedergren Simultaneously aligning sequences and finding its phylogeny, by using a character substitution model Probabilistic when scores are interpreted as log probabilities and if the procedure is additive in stead of maximizing. Allison, Wallace & Yee  But as the original S&C method it is not practical for most problems.
• Slide 47
• More probabilistic interpretations Heins affine cost algorithm Simultaneously aligning sequences and finding its phylogeny, by using affine gap penalties Probabilistic when scores are interpreted as log probabilities and if the ##### Lars R. Enevoldsen Innovation Director - DI - Dansk - lokale filer/Lars...  Lars R. Enevoldsen. Innovation
Documents Documents