Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of...
If you can't read please download the document
Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina,
Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed
by Dr. Duncan Buell Department of Computer Science and Engineering
University of South Carolina, Columbia 2002-10-11
Slide 2
Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Numerical result
Future research
Slide 3
Background Darwin: Species are related through a history of
common descent, and The history can be organized as a tree
structure (phylogeny). Modern species are put on the leaf nodes
Ancient species are put on the internal nodes The time of the
divergence is described by the length of the branches. A clade is a
group of organisms whose members share homologous features derived
from a common ancestor.
Slide 4
Phylogenetic tree Clade BranchBranch length Leaf node Current
species Internal node Ancestral species
Slide 5
Applications Phylogenetic tree is fundamental to understand
evolution and diversity Principle to organize biological data
Central to organism comparison Practical examples Resolve quarrel
over bacteria-to-human gene transfers (Nature 2001) Tracing route
of infectious disease transmission Identify new pathogens
Phylogenetic distribution of biochemical pathways
Slide 6
Use DNA data for phylogenetic inference
Slide 7
Objectives of phylogenetic inference input output Major
objectives include: Estimate the tree topology Estimate the branch
Length, and Describe the credibility of the result
Slide 8
Phylogenetic inference methods Algorithmic methods Defining a
sequence of specific steps that lead to the determination of a tree
e.g. UPGMA (unweighted pair group method using arithmetic average)
and Neighbor-Joining Optimality criterion-based methods 1) Define a
criterion; 2)Search the tree with best values Maximum parsimony
(minimize the total tree length) Maximum likelihood (maximize the
likelihood) Maximum posterior probability (the tree with the
highest probability to be the true tree)
Slide 9
Common Used phylogeny methods Data set Algorithm
Algorithmicmethod Optimization method Distance matrix Character
data UPGMA Neighbor-join Fitch-Margolish StatisticalSupported
Maximum Parsimony Maximum Likelihood Bayesian Methods Search
Strategy Greedysearch Divide &Conquer Stochasticsearch DCM,
HGT, Quartet GA, SA MCMC Exhaustive Branch & Bound Exact search
Stepwise addition Global arrangement Star decomposition
Slide 10
Aspects of phylogenetic methods Accuracy Is the constructed
tree a true tree? If not, the percentage of the wrong edges?
Complexity Neighbor-Join O(n 3 ) Maximum Parsimony (provably NP
hard) Maximum Likelihood (conjectured NP hard) Scalability Good for
small tree, how about large tree Robustness If the model or
assumption or the data is not exact correct, how about the result?
Convergence rate How long a sequence is needed to recover the true
tree? Statistical support With what probability is the computed
tree the true tree?
Slide 11
The computational challenge Compute the tree of life Source:
http://www.npaci.edu/envision/v16.3/hillis.html >1.7 million
known species Number of trees increase exponentially as new species
was added The complex of evolution Data Collection &
Computational system
Slide 12
Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Numerical result
Future research
Slide 13
Bayesian Inference-1 Both observed data and parameters of
models are random variables Setting up the joint distribution When
data D is known, Bayes theory gives: Posterior probability
likelihoodPrior probability Unconditional Probability of data
Topology Branch length Parameter of models
Slide 14
Bayesian Inference-2 P(T|D) can be interpreted as the
probability of the tree is correct We need to do at least two
things: Approximate the posterior probability distributionposterior
probability distribution Evaluate the integral for P(T|D) These can
be done via Markov Chain Monte Carlo MethodMarkov Chain Monte Carlo
Having the posterior probability distribution, we can compute the
marginal probability of T as:
Slide 15
Markov chain Monte Carlo (MCMC) The basic idea of MCMC is: To
construct a Markov chain such that: Have the parameters as the
state space, and the stationary distribution is the posterior
probability distribution of the parameters Simulate the chain Treat
the realization as a sample from the posterior probability
distribution MCMC = sampling + continue search
Slide 16
Markov chain A Markov chain is a sequence of random variables
{X 0, X 1, X 2, } whose transition kernel T(X t, X t+1 ) is
determined only by the value of X t (t>0). Stationary
distribution: (x)= x ( (x)T(x,x)) is invariant Ergodic property: p
n (x) converges to (x) as n A homogeneous Markov chain is ergodic
if min(T(x,x)/ (x)>0
Slide 17
Metropolis-Hasting algorithm-1 Cornerstone of all MCMC methods,
Metropolis(1953) Hasting proposed a generalized version in (1970)
The key point is to how to define the accepted probability:
Metropolis: Hasting: Proposal probability Can be any form Such
that
Slide 18
Metropolis-Hasting algorithm-2 1.Initialize x 0, set t=0
2.Repeat : 1)Sample x from T(x t, x) 2)Draw U~uniform[0,1]
3)Update
Slide 19
Problems of MH Algorithm & Improvement Problems: Mixing
rate is slow when: Small step->low movement Larger step->low
acceptance Stopped at local optima Dimension of state space may
vary Improvement: Metropolis-coupled MCMC Multipoint MCMC
Population-based MCMC Time-reversible jump MCMC
Slide 20
Metropolis-coupled MCMC (Geyer 1991)MCMC Run several MCMC
chains with different distribution i (x) (i=1..m) in parallel 1 (x)
is used to sampling i (x) (i=2..m) are used to improve mixing For
example: i (x) = (x) 1/(1+ (I-1)) After each iteration, attempt to
swap the states of two chains using a Metropolis-Hasting step with
acceptance probability of
Slide 21
Illustration of Metropolis-coupled MCMCMCMC 1 (x)T 1 =0 2 (x) T
2 =2 3 (x) T 3 =4 4 (x) T 4 =8 Metropolis-coupled MCMC is also
called Parallel tempering
Slide 22
Multiple Try Metropolis (Liu et al 2000) xtxt y1y1 y2y2 y3y3
y4y4 x1*x1* x2*x2* x3*x3* x4*x4* yx t+1 Sample from T( x t,.)
Choose y= y i Sample from T(y,.) Accept y or keep x t using a M-H
step
Slide 23
Population-based MCMC Metropolis-coupled MCMC uses a minimal
interaction between multiple chains, why not more active
interaction Evolutionary Monte Carlo (Liang et al 2000) Combine
Genetic Algorithm with MCMC Used to Simulate protein folding
Conjugate Gradient Monte Carlo (Liu et al 2000) Use local
optimization for adaptation An improvement of ADS (Adaptive
Direction Sampling)
Slide 24
Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Choose the evolutionary model Compute the
likelihood Design proposal mechanisms Parallel implementation
Numerical result Future research
Slide 25
DNA substitution rate matrix AG CT Consider inference of
un-rooted tree and the computational complications, some simplified
models are used (see next slide) transitiontransversion Purine
Pyrimidine
Slide 26
GTR-family of substitution models GTR: general time- reversible
model, corresponding to a symmetric rate matrix. GTR TN93 HKY85 F84
F81 JC69 K2P K3ST SYM Single substitution typeEqual base
frequencies Single substitution type Two substitution types
(transition v.s. tranversion) Three substitution type (1
transversion, 2 transition) Equal base frequencies Two substitution
types (transition v.s. tranversion) Equal base frequencies Three
substitution type (1 transversion, 2 transition)
Slide 27
More complex models Substitute rates vary across sites
Invariable sites models + gamma distribution Correlation in the
rates at adjacent sites Codon models 61X61 instantaneous rate
matrix Secondary structure models
Slide 28
Compute conditional probability of branch Given substitution
rate matrix, how to compute p(b|a,t)-the probability of a is
substituted by b after time t a b t Eigenvalue of Q
Slide 29
Likelihood of a phylogeny tree for one site x1 x2 x3 x4 x5 t1
t4 t3 t2 When x 4 x 5 are known, When x4 x5 are unknown,
Slide 30
Likelihood calculation ( Felsenstein 1981) Given a rooted tree
with n leaf nodes (species), and each leaf node is represented by a
sequence x i with length N, the likelihood of a rooted tree is
represented as:
Slide 31
Likelihood calculation-2 Felsensteins algorithms for likelihood
calculation(1981) Initiation: Set k=2n-1 Recursion: Compute for all
a as follows If k is a leaf node: Set if ; Set if. If k is not a
leaf node: compute for all a its children nodes i, j. And set
Termination: Likelihood at site u is Note: algorithm modified from
Durbin et al (1998)
Slide 32
Likelihood calculation-3 The likelihood calculation requires
filling an N X M X S X R table N: number of sequences M: number of
sites S: number of state of charactersR: number of rate categories
Taxa-1 Taxa-2 Taxa-3 Taxa-n Site 1Site 2Site m-1Site m-2 1.0 0.0 A
C G T 1.0 0.0 1.0 0.0 rate 1 rate 2rate r
Slide 33
Local update Likelihood If we just change the topology and
branch length of tree locally, we only need refresh the table at
those affected nodes. In the following example, only the nodes with
red color need to change their conditional likelihood. Original
tree Proposed tree
Slide 34
Proposal mechanism for trees Stochastic Nearest-neighbor
interchange (NNI) Larget et al (1999) Huelsenbeck (2000) d c c v u
a m y x (1)Choose a backbone (2) Change m and y randomly c d m* y*
x* c u a v
Slide 35
Proposal mechanisms for parameters Independent parameter e.g.
transition/tranversion ratio k MinMaxk k+ k- k* A set of parameters
constrained to sum to a constant e.g. base frequency distribution
Draw a sample from the Dirichlet distribution Larget et al.
(1999)
Slide 36
Bayesian phylogenetic inference Phylogenetic tree DNA Data
Evolutionary model Likelihood Prior probability Posterior prob.
MCMC Starting treeProposal A sequence of Samples inference
Approximate the distribution
Slide 37
Topics Background Bayesian phylogenetic inference and MCMC
Serial implementation Parallel implementation Challenges of serial
computation Difficulty: MCMC is a serial algorithm Multiple chains
need to be synchronized Choose appropriate grid topology
Synchronize using random number Numerical result Future
research
Slide 38
Computational challenge Computing global likelihood needs
O(NMRS 2 ) multiplications Local updating topology & branch
length needs O(MRS 2 log(N)) Updating model parameter needs O(NMRS
2 ) local update needs all required data in memory Given N=1000
species, each sequence has M=5000 sites, rate category R=5, and DNA
nucleotide model S=4 Run 5 chains each with length of 100 million
generations Needs ~400 days (assume 1% global updates, 99% local
update) And O(NMRSLX2X2X8)~32Gigabyte memory =>So until more
advanced algorithms are developed, parallel computation is the
direct solution. Use 32 processor with 1 gigabyte memory, we can
compute the problem in ~2 weeks
Slide 39
Characteristic of good parallel algorithms Balancing workload
Concurrency identify, manage, and granularity Reducing
communication Communication-to-computation ratio Frequency, volume,
balance Reducing extra work Computing assignment Redundant
work
Slide 40
Single-chain MCMC algorithm Generate initial state S 0, S (t) =
S (0) =, t=0 Propose new state S Evaluate S Compute R and U U <
R ? S (t+1) =S No Yes S (t+1) =S (t) t=t+1 t>max generation
NoYes End
Slide 41
Multiple-chain MCMC algorithm Chain #1Chain #2Chain #3Chain #4
Generate S 1 (0) t=0 Generate S 2 (0) t=0 Generate S 3 (0) t=0
Generate S 4 (0) t=0 Propose & Update S 1 (t) Propose &
Update S 2 (t) Propose & Update S 3 (t) Propose & Update S
4 (t) choose two chains to swap Compute R and U U