Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From

Multiple Loci

& a couple of unrelated observations

Elchanan Mossel, UC Berkeley

Joint work with: Sebastien Roch, Microsoft Research

At Newton Institute Dec 07

Lecture Plan

• A simple observation about gene trees and population trees.

• A comment: on “optimal” and “absolute converging” tree reconstruction

• A comment on: “Generic models”.

• A comment on: “Network Reconstruction”.

• Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )

Gene Trees and Population Trees

• Main goal in phylogenetics:• Recovering species/population histories.• Data: Current Genes.• Issue: In recent populations: gene trees may

differ from population trees. • Model for evolution of trees in populations: • Coalescence:

• Fixed size population N• Each individual chooses a random parent in

previous generation.• # generations = N £ branch-length

• Main Question: How to reconstruct population trees from gene trees?

Gene Trees: The Engineering Approach

• Two common “engineering” approaches: • Approach 1: • Assume all genes come from a single tree. • Kubato-Degnan: Inconsistent. • Approach 2: • Build tree for each tree on its own. • Take majority tree.• Degnan-Rosenberg: Inconsistent.

• Q: What should be done instead?

Gene Trees: A Rigorous Approach

• M-Roch: A consistent estimator of the molecular distance between two populations d(P1,P2) is:

• D(P1,P2) = min {dg(P1,P2) : g 2 Genes}

• ) distances between populations are identifiable.• ) tree is identifiable• Under standard coalescence assumptions, get

good rate:• P(topology error) · (# pops) £ exp(-c # genes) • c = shortest branch length.• Estimator can be “plugged in” into any distance

based method for reconstructing trees.• In M-Roch, use NJ, but similarly work for:• Short-quartets (ESSW)• Distorted metrics and forests (M)• etc.

Comments on Absolute Convergence

• Algorithmic paradigm: Want to reconstruct tree on • n species using • sequence length L and • running time T.

• “Absolute Convergence”: L = poly(n); T = poly(n).

• Q: Is this the best we can do?

resolution of Steel’s conjecture

[M’04]

[Daskalakis-M-Roch’06]

short branches seq. length L = c log n

long branches seq. length L = nC

ancestral reconstruction

phylogeneticreconstruction

n = # species

Short branches := all branches < lcLong branches := all branches > lclc depends on mutation modelbut not on tree, tree size etc.

The algorithmic challenge

• Conj: For short branches, if data is generated from the model:• ML identifies the correct using • L = O(log n) samples • (best bound known is L = exp(O(N)).

• Conclusion: In order to “beat” ML, need algorithms with L = O(log n)• Challenge: The constant in O is important!• Challenge: Deal with short/long branches (contract edges; output

forest)• Challenge: General mutation models (not just CFN, JC).

• Comment: Rigorous methods have running time gaurentee.• Comment: For L=poly(n), know how to deal with all challenges:• ESSW • M’07 (forests – long edges).• Gornieu et. al (short edges).

On generic parameters

• From Rhodes talk: “Generic models are easier to identify”.• Typically – genetic parameters.• How about generic trees?

Mixtures and Phenomena in High Dims

• The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”.

• M-Roch (in preparation):For every k, as n -> 1 the probability that a mixture of k trees on n leaves is identifiable goes to 1.

• Holds for most reasonable measures on the space of trees and most mutation models.

• Basic idea: In generic situations can (almost) cluster samples according to trees.

• Gives an efficient algorithm. • Similar results hold for rates across

A Comment on Dynamic Programming

• Q (Zhang):• Given a tree is it possible to find the • most informative k species?

• In terms of Pasrsimony?• In terms of ML?

• Note:If we know Parsimony/ML score for left/right sub-tree, we know it for the root.

• Q: Can use dynamic programming? • A: Yes – but with the right “data structure”• Information per node: • Discrete version of • the set • of achievable distributions.• Called “Density Evolution” in coding theory /

spin-glass theory.• Additive error = 1/poly(n).

Hardness of Distinguishing Network Models with Hidden Nodes

• Basic question: Is it possible to recover a network G from observation at a subset of the nodes?

• Easier question: Suppose we observe X1,…,Xr. Is it possible to determine if they come from nodes S in G1 or nodes T in G2?

• Problem: It may be that the two distributions are the same.

• Assume: The two distributions are different (large total variation distance)

• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G1 or G2?

• Related question: What is a computational model of a biologist?

The distinguishing problem for Trees

• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T1 or T2?

• Note: For trees the problem is easy:• Perform likelihood test. • Easy to do efficiently (peeling, pruning,

dynamics programming).• # samples needed poly(n).

Two Models of a Biologist

• The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G-distributions.

• The Computationally Unlimited Biologist: Can sample from any distribution.

• Related to the following problem: Can nature solve computationally hard problems?

From Shapiro at Weizmann

Hardness Results

• The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP.

• Computational Unlimited Biologist (Bogdanov-M): The problem is at least zero-knowledge hard.

• Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions?

• Related to cryptography.

Reconstructing Networks

• Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc.

• Network defines a distribution as follows:• G=(V,E) = Graph on [n] = {1,2,…,n}• Distribution defined on AV, where A is some finite

set.• Too each clique C in G, associate a function

C : AC -> R+ and:

P[] = C C(C)

• Called Markov Random Field, Factorized Distribution etc.

• Directed models also common. • Markov Property: If S separates A from B then

A and B are conditionally independent

given S

Reconstructing Networks.

• Task 1: Given samples of , find G.

• Task 2: Given samples of restricted to a set S find G.

• Will consider the problem when n large and maximum degree d is small.

• (Note that specification of the model is of size max(n,,exp(max |C|)) )

Reconstructing Networks – A Trivial Algorithm

• Lower bound (Bresler-M-Sly):• In order to recover G of max-deg d need at least

c d log n samples.• Pf follows by “counting # of networks”.• Upper bound (Bresler-M-Sly):• If distribution is “non-degenerate” c d log n

samples suffice.• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)

• For each w 2 V check if v ind. of w given N(v).

• Non-Degeneracy: • For every v and every w 2 N(v) there exists two

assignments to N(v) 1 and 2 that differ at w and:dTV(P(v | 1), P(v | 2)) ¸

• For soft-core model suffices to have for all = u,v

• maxa,b,c,d |(c,a)-(d,a)+(c,b)-(d,b)| >

• Running time = O(nd+1 log n)

A Trivial Algorithm – Related Result• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)

• For each w 2 V check v ind. of w given N(v).

• Related work• Algorithm was suggested before.• Abbeel, D. Koller, A. Ng: without restrictions learn

a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples).

• M. J. Wainwright, P. Ravikumar, J. D: Use L1 regularization to get true model for Ising models, sampling complexity O(d5 log n) – no running time bounds.

• Other related work: assuming special form of potentials

Variants of the Trivial Algorithm• If graph has exponential decay of correlations• Corr(u,v) · exp(-c d(u,v))• Suffices to enumerate over N(v) • among w correlated with v.• Running time: O(n2 log n + n f(d)).

• Missing nodes: Suppose G is triangle free, • then a variant of the algorithm can find one hidden node. • Idea (with M. Biskup’s help):

Run the algorithm as if the node is not hidden

• Noise: The algorithm tolerates small amounts of noise (statistical robustness).

• Q: What about higher amounts of noise? • (From Bresler-M-Sly)

possiblew’s

Higher Noise & Non Identifiable Example

• Bresler-M-Sly: Example of non-identifiably• Consider

• G1 = path of length 2,

• G2 = triangle + Noise.

• Assume Ising model with random interactions and random noise.

• Then with constant probability, cannot distinguish between the models.

• Ising: P[] = u,v 2 E exp( (u) (v))

• Intuitive reason: dimension of distributionis 3 in both cases. = hidden nodes

= observed nodes

Thanks !!Thanks !!

Thanks !!Thanks !!• Sebastien Roch

•Costis Daskalakis

• Andrej Bogdanov

Thanks !!Thanks !!Fascinating workshop:

Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and

Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and Evolution

As part of a great program:

Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and

Professor D Huson (Tubingen)

Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Documents

Lectures prepared by: Elchanan Mossel elenaShvetsmossel/teach/134f06...Variance and Standard Deviation •The variance of X, denoted by Var (X) is the mean squared deviation of X from

Lectures prepared by: Elchanan Mossel elena Shvets

11/4/20151 Markovian Models of Genetic Inheritance Elchanan Mossel, U.C. Berkeley mossel@stat.berkeley.edumossel@stat.berkeley.edu, mossel

Lectures prepared by: Elchanan Mossel Yelena Shvetsmossel/teach/134f06...Lectures prepared by: Elchanan Mossel Yelena Shvets Berkeley Stat 134 FAll2005 Introduction to probability

Competitive Contagion in Networksmkearns/papers/CompetitiveContagionG… · and Tardos [19, 20]; Mossel and Roch [21]; Borodin, Filmus, and Oren [8]; Chasparis and Shamma [9]; Carnes

A New Look at Survey Propagation and Its Generalizationswainwrig/Papers/ManMosWai_JA… · A New Look at Survey Propagation and Its Generalizations ELITZA MANEVA, ELCHANAN MOSSEL,

Invariance principle on the slice - IAS School of …Invariance principle on the slice Yuval Filmus Guy Kindler: Elchanan Mossel; Karl Wimmerx April 24, 2015 Abstract We prove an invariance

Lecture 5: Reconstruction of some non-tree networks Elchanan Mossel

1 Biased card shuffling and the asymmetric exclusion process Elchanan Mossel, Microsoft Research Joint work with Itai Benjamini, Microsoft Research Noam

Survey Propagation Algorithm Elitza Maneva UC Berkeley Joint work with Elchanan Mossel and Martin Wainwright

Probability Models of Information Exchange on Networks ...cpss/2013/Mossel-Lec2.pdf · Probability Models of Information Exchange on Networks Lecture 2 Elchanan Mossel ... The DeGroot

Noise stability of functions with low in uences ...odonnell/papers/invariance.pdfNoise stability of functions with low in uences: invariance and optimality Elchanan Mossel U.C. Berkeley

Vol.1: Geometry Subhash Khot IAS Elchanan Mossel UC Berkeley Guy Kindler DIMACS Ryan O’Donnell IAS

Lectures prepared by: Elchanan Mossel elena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section

Lectures prepared by: Elchanan Mossel Yelena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section

Some Recent Progress in Combinatorial Statistics Elchanan Mossel, UC Berkeley + Weizmann Institute mossel

Elchanan Mossel - Springer...1714 E. MOSSEL GAFA 1 Introduction 1.1 Harmonic analysis of boolean functions. This paper studies low inﬂu-ence functions f:Ωn → [0,1], where (Ωn,µn)

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard

Elchanan Mossel Allan Sly Omer Tamuz July 26, 2012 · Elchanan Mossel Allan Sly Omer Tamuz July 26, 2012 ... M enager [12] in particular describes a model similar to ours and proves

Lectures prepared by: Elchanan Mossel Yelena Shvets