Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina,

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

  • Slide 1
  • Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina, Columbia 2002-10-11
  • Slide 2
  • Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Numerical result Future research
  • Slide 3
  • Background Darwin: Species are related through a history of common descent, and The history can be organized as a tree structure (phylogeny). Modern species are put on the leaf nodes Ancient species are put on the internal nodes The time of the divergence is described by the length of the branches. A clade is a group of organisms whose members share homologous features derived from a common ancestor.
  • Slide 4
  • Phylogenetic tree Clade BranchBranch length Leaf node Current species Internal node Ancestral species
  • Slide 5
  • Applications Phylogenetic tree is fundamental to understand evolution and diversity Principle to organize biological data Central to organism comparison Practical examples Resolve quarrel over bacteria-to-human gene transfers (Nature 2001) Tracing route of infectious disease transmission Identify new pathogens Phylogenetic distribution of biochemical pathways
  • Slide 6
  • Use DNA data for phylogenetic inference
  • Slide 7
  • Objectives of phylogenetic inference input output Major objectives include: Estimate the tree topology Estimate the branch Length, and Describe the credibility of the result
  • Slide 8
  • Phylogenetic inference methods Algorithmic methods Defining a sequence of specific steps that lead to the determination of a tree e.g. UPGMA (unweighted pair group method using arithmetic average) and Neighbor-Joining Optimality criterion-based methods 1) Define a criterion; 2)Search the tree with best values Maximum parsimony (minimize the total tree length) Maximum likelihood (maximize the likelihood) Maximum posterior probability (the tree with the highest probability to be the true tree)
  • Slide 9
  • Common Used phylogeny methods Data set Algorithm Algorithmicmethod Optimization method Distance matrix Character data UPGMA Neighbor-join Fitch-Margolish StatisticalSupported Maximum Parsimony Maximum Likelihood Bayesian Methods Search Strategy Greedysearch Divide &Conquer Stochasticsearch DCM, HGT, Quartet GA, SA MCMC Exhaustive Branch & Bound Exact search Stepwise addition Global arrangement Star decomposition
  • Slide 10
  • Aspects of phylogenetic methods Accuracy Is the constructed tree a true tree? If not, the percentage of the wrong edges? Complexity Neighbor-Join O(n 3 ) Maximum Parsimony (provably NP hard) Maximum Likelihood (conjectured NP hard) Scalability Good for small tree, how about large tree Robustness If the model or assumption or the data is not exact correct, how about the result? Convergence rate How long a sequence is needed to recover the true tree? Statistical support With what probability is the computed tree the true tree?
  • Slide 11
  • The computational challenge Compute the tree of life Source: http://www.npaci.edu/envision/v16.3/hillis.html >1.7 million known species Number of trees increase exponentially as new species was added The complex of evolution Data Collection & Computational system
  • Slide 12
  • Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Numerical result Future research
  • Slide 13
  • Bayesian Inference-1 Both observed data and parameters of models are random variables Setting up the joint distribution When data D is known, Bayes theory gives: Posterior probability likelihoodPrior probability Unconditional Probability of data Topology Branch length Parameter of models
  • Slide 14
  • Bayesian Inference-2 P(T|D) can be interpreted as the probability of the tree is correct We need to do at least two things: Approximate the posterior probability distributionposterior probability distribution Evaluate the integral for P(T|D) These can be done via Markov Chain Monte Carlo MethodMarkov Chain Monte Carlo Having the posterior probability distribution, we can compute the marginal probability of T as:
  • Slide 15
  • Markov chain Monte Carlo (MCMC) The basic idea of MCMC is: To construct a Markov chain such that: Have the parameters as the state space, and the stationary distribution is the posterior probability distribution of the parameters Simulate the chain Treat the realization as a sample from the posterior probability distribution MCMC = sampling + continue search
  • Slide 16
  • Markov chain A Markov chain is a sequence of random variables {X 0, X 1, X 2, } whose transition kernel T(X t, X t+1 ) is determined only by the value of X t (t>0). Stationary distribution: (x)= x ( (x)T(x,x)) is invariant Ergodic property: p n (x) converges to (x) as n A homogeneous Markov chain is ergodic if min(T(x,x)/ (x)>0
  • Slide 17
  • Metropolis-Hasting algorithm-1 Cornerstone of all MCMC methods, Metropolis(1953) Hasting proposed a generalized version in (1970) The key point is to how to define the accepted probability: Metropolis: Hasting: Proposal probability Can be any form Such that
  • Slide 18
  • Metropolis-Hasting algorithm-2 1.Initialize x 0, set t=0 2.Repeat : 1)Sample x from T(x t, x) 2)Draw U~uniform[0,1] 3)Update
  • Slide 19
  • Problems of MH Algorithm & Improvement Problems: Mixing rate is slow when: Small step->low movement Larger step->low acceptance Stopped at local optima Dimension of state space may vary Improvement: Metropolis-coupled MCMC Multipoint MCMC Population-based MCMC Time-reversible jump MCMC
  • Slide 20
  • Metropolis-coupled MCMC (Geyer 1991)MCMC Run several MCMC chains with different distribution i (x) (i=1..m) in parallel 1 (x) is used to sampling i (x) (i=2..m) are used to improve mixing For example: i (x) = (x) 1/(1+ (I-1)) After each iteration, attempt to swap the states of two chains using a Metropolis-Hasting step with acceptance probability of
  • Slide 21
  • Illustration of Metropolis-coupled MCMCMCMC 1 (x)T 1 =0 2 (x) T 2 =2 3 (x) T 3 =4 4 (x) T 4 =8 Metropolis-coupled MCMC is also called Parallel tempering
  • Slide 22
  • Multiple Try Metropolis (Liu et al 2000) xtxt y1y1 y2y2 y3y3 y4y4 x1*x1* x2*x2* x3*x3* x4*x4* yx t+1 Sample from T( x t,.) Choose y= y i Sample from T(y,.) Accept y or keep x t using a M-H step
  • Slide 23
  • Population-based MCMC Metropolis-coupled MCMC uses a minimal interaction between multiple chains, why not more active interaction Evolutionary Monte Carlo (Liang et al 2000) Combine Genetic Algorithm with MCMC Used to Simulate protein folding Conjugate Gradient Monte Carlo (Liu et al 2000) Use local optimization for adaptation An improvement of ADS (Adaptive Direction Sampling)
  • Slide 24
  • Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Choose the evolutionary model Compute the likelihood Design proposal mechanisms Parallel implementation Numerical result Future research
  • Slide 25
  • DNA substitution rate matrix AG CT Consider inference of un-rooted tree and the computational complications, some simplified models are used (see next slide) transitiontransversion Purine Pyrimidine
  • Slide 26
  • GTR-family of substitution models GTR: general time- reversible model, corresponding to a symmetric rate matrix. GTR TN93 HKY85 F84 F81 JC69 K2P K3ST SYM Single substitution typeEqual base frequencies Single substitution type Two substitution types (transition v.s. tranversion) Three substitution type (1 transversion, 2 transition) Equal base frequencies Two substitution types (transition v.s. tranversion) Equal base frequencies Three substitution type (1 transversion, 2 transition)
  • Slide 27
  • More complex models Substitute rates vary across sites Invariable sites models + gamma distribution Correlation in the rates at adjacent sites Codon models 61X61 instantaneous rate matrix Secondary structure models
  • Slide 28
  • Compute conditional probability of branch Given substitution rate matrix, how to compute p(b|a,t)-the probability of a is substituted by b after time t a b t Eigenvalue of Q
  • Slide 29
  • Likelihood of a phylogeny tree for one site x1 x2 x3 x4 x5 t1 t4 t3 t2 When x 4 x 5 are known, When x4 x5 are unknown,
  • Slide 30
  • Likelihood calculation ( Felsenstein 1981) Given a rooted tree with n leaf nodes (species), and each leaf node is represented by a sequence x i with length N, the likelihood of a rooted tree is represented as:
  • Slide 31
  • Likelihood calculation-2 Felsensteins algorithms for likelihood calculation(1981) Initiation: Set k=2n-1 Recursion: Compute for all a as follows If k is a leaf node: Set if ; Set if. If k is not a leaf node: compute for all a its children nodes i, j. And set Termination: Likelihood at site u is Note: algorithm modified from Durbin et al (1998)
  • Slide 32
  • Likelihood calculation-3 The likelihood calculation requires filling an N X M X S X R table N: number of sequences M: number of sites S: number of state of charactersR: number of rate categories Taxa-1 Taxa-2 Taxa-3 Taxa-n Site 1Site 2Site m-1Site m-2 1.0 0.0 A C G T 1.0 0.0 1.0 0.0 rate 1 rate 2rate r
  • Slide 33
  • Local update Likelihood If we just change the topology and branch length of tree locally, we only need refresh the table at those affected nodes. In the following example, only the nodes with red color need to change their conditional likelihood. Original tree Proposed tree
  • Slide 34
  • Proposal mechanism for trees Stochastic Nearest-neighbor interchange (NNI) Larget et al (1999) Huelsenbeck (2000) d c c v u a m y x (1)Choose a backbone (2) Change m and y randomly c d m* y* x* c u a v
  • Slide 35
  • Proposal mechanisms for parameters Independent parameter e.g. transition/tranversion ratio k MinMaxk k+ k- k* A set of parameters constrained to sum to a constant e.g. base frequency distribution Draw a sample from the Dirichlet distribution Larget et al. (1999)
  • Slide 36
  • Bayesian phylogenetic inference Phylogenetic tree DNA Data Evolutionary model Likelihood Prior probability Posterior prob. MCMC Starting treeProposal A sequence of Samples inference Approximate the distribution
  • Slide 37
  • Topics Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation Challenges of serial computation Difficulty: MCMC is a serial algorithm Multiple chains need to be synchronized Choose appropriate grid topology Synchronize using random number Numerical result Future research
  • Slide 38
  • Computational challenge Computing global likelihood needs O(NMRS 2 ) multiplications Local updating topology & branch length needs O(MRS 2 log(N)) Updating model parameter needs O(NMRS 2 ) local update needs all required data in memory Given N=1000 species, each sequence has M=5000 sites, rate category R=5, and DNA nucleotide model S=4 Run 5 chains each with length of 100 million generations Needs ~400 days (assume 1% global updates, 99% local update) And O(NMRSLX2X2X8)~32Gigabyte memory =>So until more advanced algorithms are developed, parallel computation is the direct solution. Use 32 processor with 1 gigabyte memory, we can compute the problem in ~2 weeks
  • Slide 39
  • Characteristic of good parallel algorithms Balancing workload Concurrency identify, manage, and granularity Reducing communication Communication-to-computation ratio Frequency, volume, balance Reducing extra work Computing assignment Redundant work
  • Slide 40
  • Single-chain MCMC algorithm Generate initial state S 0, S (t) = S (0) =, t=0 Propose new state S Evaluate S Compute R and U U < R ? S (t+1) =S No Yes S (t+1) =S (t) t=t+1 t>max generation NoYes End
  • Slide 41
  • Multiple-chain MCMC algorithm Chain #1Chain #2Chain #3Chain #4 Generate S 1 (0) t=0 Generate S 2 (0) t=0 Generate S 3 (0) t=0 Generate S 4 (0) t=0 Propose & Update S 1 (t) Propose & Update S 2 (t) Propose & Update S 3 (t) Propose & Update S 4 (t) choose two chains to swap Compute R and U U