View
26
Download
0
Category
Tags:
Preview:
DESCRIPTION
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations. Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07. Lecture Plan. - PowerPoint PPT Presentation
Citation preview
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From
Multiple Loci
& a couple of unrelated observations
Elchanan Mossel, UC Berkeley
Joint work with: Sebastien Roch, Microsoft Research
At Newton Institute Dec 07
Lecture Plan
• A simple observation about gene trees and population trees.
• A comment: on “optimal” and “absolute converging” tree reconstruction
• A comment on: “Generic models”.
• A comment on: “Network Reconstruction”.
• Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )
Gene Trees and Population Trees
• Main goal in phylogenetics:• Recovering species/population histories.• Data: Current Genes.• Issue: In recent populations: gene trees may
differ from population trees. • Model for evolution of trees in populations: • Coalescence:
• Fixed size population N• Each individual chooses a random parent in
previous generation.• # generations = N £ branch-length
• Main Question: How to reconstruct population trees from gene trees?
Gene Trees: The Engineering Approach
• Two common “engineering” approaches: • Approach 1: • Assume all genes come from a single tree. • Kubato-Degnan: Inconsistent. • Approach 2: • Build tree for each tree on its own. • Take majority tree.• Degnan-Rosenberg: Inconsistent.
• Q: What should be done instead?
Gene Trees: A Rigorous Approach
• M-Roch: A consistent estimator of the molecular distance between two populations d(P1,P2) is:
• D(P1,P2) = min {dg(P1,P2) : g 2 Genes}
• ) distances between populations are identifiable.• ) tree is identifiable• Under standard coalescence assumptions, get
good rate:• P(topology error) · (# pops) £ exp(-c # genes) • c = shortest branch length.• Estimator can be “plugged in” into any distance
based method for reconstructing trees.• In M-Roch, use NJ, but similarly work for:• Short-quartets (ESSW)• Distorted metrics and forests (M)• etc.
Comments on Absolute Convergence
• Algorithmic paradigm: Want to reconstruct tree on • n species using • sequence length L and • running time T.
• “Absolute Convergence”: L = poly(n); T = poly(n).
• Q: Is this the best we can do?
resolution of Steel’s conjecture
[M’04]
[Daskalakis-M-Roch’06]
short branches seq. length L = c log n
long branches seq. length L = nC
ancestral reconstruction
phylogeneticreconstruction
n = # species
Short branches := all branches < lcLong branches := all branches > lclc depends on mutation modelbut not on tree, tree size etc.
The algorithmic challenge
• Conj: For short branches, if data is generated from the model:• ML identifies the correct using • L = O(log n) samples • (best bound known is L = exp(O(N)).
• Conclusion: In order to “beat” ML, need algorithms with L = O(log n)• Challenge: The constant in O is important!• Challenge: Deal with short/long branches (contract edges; output
forest)• Challenge: General mutation models (not just CFN, JC).
• Comment: Rigorous methods have running time gaurentee.• Comment: For L=poly(n), know how to deal with all challenges:• ESSW • M’07 (forests – long edges).• Gornieu et. al (short edges).
On generic parameters
• From Rhodes talk: “Generic models are easier to identify”.• Typically – genetic parameters.• How about generic trees?
Mixtures and Phenomena in High Dims
• The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”.
• M-Roch (in preparation):For every k, as n -> 1 the probability that a mixture of k trees on n leaves is identifiable goes to 1.
• Holds for most reasonable measures on the space of trees and most mutation models.
• Basic idea: In generic situations can (almost) cluster samples according to trees.
• Gives an efficient algorithm. • Similar results hold for rates across
sites
A Comment on Dynamic Programming
• Q (Zhang):• Given a tree is it possible to find the • most informative k species?
• In terms of Pasrsimony?• In terms of ML?
• Note:If we know Parsimony/ML score for left/right sub-tree, we know it for the root.
• Q: Can use dynamic programming? • A: Yes – but with the right “data structure”• Information per node: • Discrete version of • the set • of achievable distributions.• Called “Density Evolution” in coding theory /
spin-glass theory.• Additive error = 1/poly(n).
L1L2
L
L1L2
L
Hardness of Distinguishing Network Models with Hidden Nodes
• Basic question: Is it possible to recover a network G from observation at a subset of the nodes?
• Easier question: Suppose we observe X1,…,Xr. Is it possible to determine if they come from nodes S in G1 or nodes T in G2?
• Problem: It may be that the two distributions are the same.
• Assume: The two distributions are different (large total variation distance)
• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G1 or G2?
• Related question: What is a computational model of a biologist?
G1
G2
The distinguishing problem for Trees
• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T1 or T2?
• Note: For trees the problem is easy:• Perform likelihood test. • Easy to do efficiently (peeling, pruning,
dynamics programming).• # samples needed poly(n).
T1
T2
Two Models of a Biologist
• The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G-distributions.
• The Computationally Unlimited Biologist: Can sample from any distribution.
• Related to the following problem: Can nature solve computationally hard problems?
From Shapiro at Weizmann
Hardness Results
• The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP.
• Computational Unlimited Biologist (Bogdanov-M): The problem is at least zero-knowledge hard.
• Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions?
• Related to cryptography.
G1
G2
Reconstructing Networks
• Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc.
• Network defines a distribution as follows:• G=(V,E) = Graph on [n] = {1,2,…,n}• Distribution defined on AV, where A is some finite
set.• Too each clique C in G, associate a function
C : AC -> R+ and:
P[] = C C(C)
• Called Markov Random Field, Factorized Distribution etc.
• Directed models also common. • Markov Property: If S separates A from B then
A and B are conditionally independent
given S
Reconstructing Networks.
• Task 1: Given samples of , find G.
• Task 2: Given samples of restricted to a set S find G.
• Will consider the problem when n large and maximum degree d is small.
• (Note that specification of the model is of size max(n,,exp(max |C|)) )
Reconstructing Networks – A Trivial Algorithm
• Lower bound (Bresler-M-Sly):• In order to recover G of max-deg d need at least
c d log n samples.• Pf follows by “counting # of networks”.• Upper bound (Bresler-M-Sly):• If distribution is “non-degenerate” c d log n
samples suffice.• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)
• For each w 2 V check if v ind. of w given N(v).
• Non-Degeneracy: • For every v and every w 2 N(v) there exists two
assignments to N(v) 1 and 2 that differ at w and:dTV(P(v | 1), P(v | 2)) ¸
• For soft-core model suffices to have for all = u,v
• maxa,b,c,d |(c,a)-(d,a)+(c,b)-(d,b)| >
• Running time = O(nd+1 log n)
A Trivial Algorithm – Related Result• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)
• For each w 2 V check v ind. of w given N(v).
• Related work• Algorithm was suggested before.• Abbeel, D. Koller, A. Ng: without restrictions learn
a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples).
• M. J. Wainwright, P. Ravikumar, J. D: Use L1 regularization to get true model for Ising models, sampling complexity O(d5 log n) – no running time bounds.
• Other related work: assuming special form of potentials
Variants of the Trivial Algorithm• If graph has exponential decay of correlations• Corr(u,v) · exp(-c d(u,v))• Suffices to enumerate over N(v) • among w correlated with v.• Running time: O(n2 log n + n f(d)).
• Missing nodes: Suppose G is triangle free, • then a variant of the algorithm can find one hidden node. • Idea (with M. Biskup’s help):
Run the algorithm as if the node is not hidden
• Noise: The algorithm tolerates small amounts of noise (statistical robustness).
• Q: What about higher amounts of noise? • (From Bresler-M-Sly)
possiblew’s
Higher Noise & Non Identifiable Example
• Bresler-M-Sly: Example of non-identifiably• Consider
• G1 = path of length 2,
• G2 = triangle + Noise.
• Assume Ising model with random interactions and random noise.
• Then with constant probability, cannot distinguish between the models.
• Ising: P[] = u,v 2 E exp( (u) (v))
• Intuitive reason: dimension of distributionis 3 in both cases. = hidden nodes
= observed nodes
Thanks !!Thanks !!
Thanks !!Thanks !!• Sebastien Roch
•Costis Daskalakis
• Andrej Bogdanov
Thanks !!Thanks !!Fascinating workshop:
Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and
Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and Evolution
As part of a great program:
Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and
Professor D Huson (Tubingen)
Recommended