Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina,

Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina, Columbia 2002-10-11

Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Numerical result Future research

Background Darwin: Species are related through a history of common descent, and The history can be organized as a tree structure (phylogeny). Modern species are put on the leaf nodes Ancient species are put on the internal nodes The time of the divergence is described by the length of the branches. A clade is a group of organisms whose members share homologous features derived from a common ancestor.

Phylogenetic tree Clade BranchBranch length Leaf node Current species Internal node Ancestral species

Applications Phylogenetic tree is fundamental to understand evolution and diversity Principle to organize biological data Central to organism comparison Practical examples Resolve quarrel over bacteria-to-human gene transfers (Nature 2001) Tracing route of infectious disease transmission Identify new pathogens Phylogenetic distribution of biochemical pathways

Use DNA data for phylogenetic inference

Objectives of phylogenetic inference input output Major objectives include: Estimate the tree topology Estimate the branch Length, and Describe the credibility of the result

Phylogenetic inference methods Algorithmic methods Defining a sequence of specific steps that lead to the determination of a tree e.g. UPGMA (unweighted pair group method using arithmetic average) and Neighbor-Joining Optimality criterion-based methods 1) Define a criterion; 2)Search the tree with best values Maximum parsimony (minimize the total tree length) Maximum likelihood (maximize the likelihood) Maximum posterior probability (the tree with the highest probability to be the true tree)

Common Used phylogeny methods Data set Algorithm Algorithmicmethod Optimization method Distance matrix Character data UPGMA Neighbor-join Fitch-Margolish StatisticalSupported Maximum Parsimony Maximum Likelihood Bayesian Methods Search Strategy Greedysearch Divide &Conquer Stochasticsearch DCM, HGT, Quartet GA, SA MCMC Exhaustive Branch & Bound Exact search Stepwise addition Global arrangement Star decomposition

Aspects of phylogenetic methods Accuracy Is the constructed tree a true tree? If not, the percentage of the wrong edges? Complexity Neighbor-Join O(n 3 ) Maximum Parsimony (provably NP hard) Maximum Likelihood (conjectured NP hard) Scalability Good for small tree, how about large tree Robustness If the model or assumption or the data is not exact correct, how about the result? Convergence rate How long a sequence is needed to recover the true tree? Statistical support With what probability is the computed tree the true tree?

The computational challenge Compute the tree of life Source: http://www.npaci.edu/envision/v16.3/hillis.html >1.7 million known species Number of trees increase exponentially as new species was added The complex of evolution Data Collection & Computational system

Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Numerical result Future research

Bayesian Inference-1 Both observed data and parameters of models are random variables Setting up the joint distribution When data D is known, Bayes theory gives: Posterior probability likelihoodPrior probability Unconditional Probability of data Topology Branch length Parameter of models

Bayesian Inference-2 P(T|D) can be interpreted as the probability of the tree is correct We need to do at least two things: Approximate the posterior probability distributionposterior probability distribution Evaluate the integral for P(T|D) These can be done via Markov Chain Monte Carlo MethodMarkov Chain Monte Carlo Having the posterior probability distribution, we can compute the marginal probability of T as:

Markov chain Monte Carlo (MCMC) The basic idea of MCMC is: To construct a Markov chain such that: Have the parameters as the state space, and the stationary distribution is the posterior probability distribution of the parameters Simulate the chain Treat the realization as a sample from the posterior probability distribution MCMC = sampling + continue search

Markov chain A Markov chain is a sequence of random variables {X 0, X 1, X 2, } whose transition kernel T(X t, X t+1 ) is determined only by the value of X t (t>0). Stationary distribution: (x)= x ( (x)T(x,x)) is invariant Ergodic property: p n (x) converges to (x) as n A homogeneous Markov chain is ergodic if min(T(x,x)/ (x)>0

Metropolis-Hasting algorithm-1 Cornerstone of all MCMC methods, Metropolis(1953) Hasting proposed a generalized version in (1970) The key point is to how to define the accepted probability: Metropolis: Hasting: Proposal probability Can be any form Such that

Metropolis-Hasting algorithm-2 1.Initialize x 0, set t=0 2.Repeat : 1)Sample x from T(x t, x) 2)Draw U~uniform[0,1] 3)Update

Problems of MH Algorithm & Improvement Problems: Mixing rate is slow when: Small step->low movement Larger step->low acceptance Stopped at local optima Dimension of state space may vary Improvement: Metropolis-coupled MCMC Multipoint MCMC Population-based MCMC Time-reversible jump MCMC

Metropolis-coupled MCMC (Geyer 1991)MCMC Run several MCMC chains with different distribution i (x) (i=1..m) in parallel 1 (x) is used to sampling i (x) (i=2..m) are used to improve mixing For example: i (x) = (x) 1/(1+ (I-1)) After each iteration, attempt to swap the states of two chains using a Metropolis-Hasting step with acceptance probability of

Illustration of Metropolis-coupled MCMCMCMC 1 (x)T 1 =0 2 (x) T 2 =2 3 (x) T 3 =4 4 (x) T 4 =8 Metropolis-coupled MCMC is also called Parallel tempering

Multiple Try Metropolis (Liu et al 2000) xtxt y1y1 y2y2 y3y3 y4y4 x1*x1* x2*x2* x3*x3* x4*x4* yx t+1 Sample from T( x t,.) Choose y= y i Sample from T(y,.) Accept y or keep x t using a M-H step

Population-based MCMC Metropolis-coupled MCMC uses a minimal interaction between multiple chains, why not more active interaction Evolutionary Monte Carlo (Liang et al 2000) Combine Genetic Algorithm with MCMC Used to Simulate protein folding Conjugate Gradient Monte Carlo (Liu et al 2000) Use local optimization for adaptation An improvement of ADS (Adaptive Direction Sampling)

Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Choose the evolutionary model Compute the likelihood Design proposal mechanisms Parallel implementation Numerical result Future research

DNA substitution rate matrix AG CT Consider inference of un-rooted tree and the computational complications, some simplified models are used (see next slide) transitiontransversion Purine Pyrimidine

GTR-family of substitution models GTR: general time- reversible model, corresponding to a symmetric rate matrix. GTR TN93 HKY85 F84 F81 JC69 K2P K3ST SYM Single substitution typeEqual base frequencies Single substitution type Two substitution types (transition v.s. tranversion) Three substitution type (1 transversion, 2 transition) Equal base frequencies Two substitution types (transition v.s. tranversion) Equal base frequencies Three substitution type (1 transversion, 2 transition)

More complex models Substitute rates vary across sites Invariable sites models + gamma distribution Correlation in the rates at adjacent sites Codon models 61X61 instantaneous rate matrix Secondary structure models

Compute conditional probability of branch Given substitution rate matrix, how to compute p(b|a,t)-the probability of a is substituted by b after time t a b t Eigenvalue of Q

Likelihood of a phylogeny tree for one site x1 x2 x3 x4 x5 t1 t4 t3 t2 When x 4 x 5 are known, When x4 x5 are unknown,

Likelihood calculation ( Felsenstein 1981) Given a rooted tree with n leaf nodes (species), and each leaf node is represented by a sequence x i with length N, the likelihood of a rooted tree is represented as:

Likelihood calculation-2 Felsensteins algorithms for likelihood calculation(1981) Initiation: Set k=2n-1 Recursion: Compute for all a as follows If k is a leaf node: Set if ; Set if. If k is not a leaf node: compute for all a its children nodes i, j. And set Termination: Likelihood at site u is Note: algorithm modified from Durbin et al (1998)

Likelihood calculation-3 The likelihood calculation requires filling an N X M X S X R table N: number of sequences M: number of sites S: number of state of charactersR: number of rate categories Taxa-1 Taxa-2 Taxa-3 Taxa-n Site 1Site 2Site m-1Site m-2 1.0 0.0 A C G T 1.0 0.0 1.0 0.0 rate 1 rate 2rate r

Local update Likelihood If we just change the topology and branch length of tree locally, we only need refresh the table at those affected nodes. In the following example, only the nodes with red color need to change their conditional likelihood. Original tree Proposed tree

Proposal mechanism for trees Stochastic Nearest-neighbor interchange (NNI) Larget et al (1999) Huelsenbeck (2000) d c c v u a m y x (1)Choose a backbone (2) Change m and y randomly c d m* y* x* c u a v

Proposal mechanisms for parameters Independent parameter e.g. transition/tranversion ratio k MinMaxk k+ k- k* A set of parameters constrained to sum to a constant e.g. base frequency distribution Draw a sample from the Dirichlet distribution Larget et al. (1999)

Bayesian phylogenetic inference Phylogenetic tree DNA Data Evolutionary model Likelihood Prior probability Posterior prob. MCMC Starting treeProposal A sequence of Samples inference Approximate the distribution

Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Challenges of serial computation Difficulty: MCMC is a serial algorithm Multiple chains need to be synchronized Choose appropriate grid topology Synchronize using random number Numerical result Future research

Computational challenge Computing global likelihood needs O(NMRS 2 ) multiplications Local updating topology & branch length needs O(MRS 2 log(N)) Updating model parameter needs O(NMRS 2 ) local update needs all required data in memory Given N=1000 species, each sequence has M=5000 sites, rate category R=5, and DNA nucleotide model S=4 Run 5 chains each with length of 100 million generations Needs ~400 days (assume 1% global updates, 99% local update) And O(NMRSLX2X2X8)~32Gigabyte memory =>So until more advanced algorithms are developed, parallel computation is the direct solution. Use 32 processor with 1 gigabyte memory, we can compute the problem in ~2 weeks

Characteristic of good parallel algorithms Balancing workload Concurrency identify, manage, and granularity Reducing communication Communication-to-computation ratio Frequency, volume, balance Reducing extra work Computing assignment Redundant work

Single-chain MCMC algorithm Generate initial state S 0, S (t) = S (0) =, t=0 Propose new state S Evaluate S Compute R and U U < R ? S (t+1) =S No Yes S (t+1) =S (t) t=t+1 t>max generation NoYes End

Multiple-chain MCMC algorithm Chain #1Chain #2Chain #3Chain #4 Generate S 1 (0) t=0 Generate S 2 (0) t=0 Generate S 3 (0) t=0 Generate S 4 (0) t=0 Propose & Update S 1 (t) Propose & Update S 2 (t) Propose & Update S 3 (t) Propose & Update S 4 (t) choose two chains to swap Compute R and U U

Documents

Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina,