View
23
Download
2
Category
Tags:
Preview:
DESCRIPTION
How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences. Tanya Y. Berger-Wolf Laboratory for High-Performance Algorithm Engineering and Computational Biology Dept. of Computer Science University of New Mexico www.compbio.unm.edu. - PowerPoint PPT Presentation
Citation preview
How to See a Tree for a Forest?How to See a Tree for a Forest?Combining Phylogenetic Trees – Combining Phylogenetic Trees –
Reasons, Methods, and ConsequencesReasons, Methods, and Consequences
Tanya Y. Berger-WolfLaboratory for High-Performance Algorithm Engineering and
Computational BiologyDept. of Computer ScienceUniversity of New Mexico
www.compbio.unm.edu
Phylogeny Reconstruction
Orangutan Chimpanzee HumanGorilla
Phylogeny Reconstruction
1. Get an estimate of evolutionary distance between species
2. Treat the species as a set of points with pairwise distance measure
3. Find a tree that optimizes{parsimony, likelihood, function of your choice}on that set of points
Overview of My Research• Computational Phylogeny
– Comparison of methods that combine trees (greed is bad)
– Topological accuracy of maximum parsimony
• Is optimal necessary?• How to know when “good enough”?
– Online consensus and other statistics– Heterogeneous data in phylogeny
• Controlled animal breeding strategies
• Computational Phylogeny– Comparison of methods that combine trees
(greed is bad)– Topological accuracy of maximum
parsimony• Is optimal necessary?• How to know when “good enough”?
– Online consensus and other statistics– Heterogeneous data in phylogeny
• Controlled animal breeding strategies
Computational Pitfalls
• Resulting optimization problems are hard
• Existing heuristics expensive on large datasets
• Same score – many topologies
• True tree is unknown
⇓When to stop and what to return?
Consensus Methods
ABCDE
ACBDE
ABCDE
+
=
Consensus is what many people say in chorus but do not believe as individuals
Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990
Consensus Methods: StrictMcMorris et al. (83)
E
ABCD
E
ABCD
E
ABCD
AB CD ABCDABCDE
AB ABC DEABCDE
BCD ABCDABCDE
Strict: contains clades common to all trees
E
ABCD
Consensus Methods: MajorityMargush & McMorris (81), McMorris et al. (83), Barthelemy & McMorris (86)
E
ABCD
E
ABCD
E
ABCD
AB CD ABCDABCDE
AB ABC DEABCDE
BCD ABCDABCDE
Majority: contains clades common to majority
AB CD ABCD AB ABC DE BCD ABCD
E
ABCD
Stopping Maximum Parsimony(joint work with T.Williams, B.M.E.Moret, U.Roshan, T.Warnow)
If return Majority Consensus of the top scoring trees how early can we stop without changing the outcome? What stopping criteria?
Biological datasets: •three567: “three-gene” (rbcL, atpB, and 18s) DNA sequences (Soltis et al., 2000)
•aster328: ITS RNA sequences from the plant Asteracaeae (Gutell Lab, ICMB, UT Austin)
•ocho854: rbcL DNA sequences (Goloboff, 1999)
•lipsc439: rDNA sequences of Eukaryotes (Goloboff, 1999)
•john921: Avian Cytochrome b DNA sequences (Johnson, 2001)
•eern476: Metazoan DNA sequences (Goloboff, 1999)
•will2000: Eukaryotic sRNA sequences (Gutell Lab, ICMB, UT Austin)
•rbcL500: rbcL DNA sequences (Rice et al., 1997)
•mari2594: rbcL DNA sequences (Kallerjo et al., 1998)
Experiment DesignATTCGGAAGCGATAGCTGAATCGATCGATCGTATTACGTTAGCTAGTATGCAGCGGAG
Biological dataset
Run parsimony ratchet (PAUP*)500 iterations, 5 repetitionsSave the tree at each iteration
Majority consensus ofoptimal trees (PAUP*)
Output consensus tree
…Optimal - best scoring treesin all repetitions
Majority consensus ofbest and second bestso far
Results
rbcl500
02468
10121416
0 50 100 150 200 250 300 350 400 450 500
Iteration
RF
rate
(%)
0.001
0.01
0.1
1
MP
Sco
re (%
)
Optimal-best MRC
Best-second best MRC
Score error (from optimal)
Results
aster328
0
2
4
6
8
10
12
0 50 100 150 200 250 300 350 400 450 500
Iteration
RF
rat
e (%
)
0.001
0.01
0.1
1
MP
Sco
re (
%)
Optimal-best MRC
Best-second best MRC
Score error (from optimal)
rbcl500
02468
10121416
0 50 100 150 200 250 300 350 400 450 500
Iteration
RF
rat
e (%
)
0.001
0.01
0.1
1
MP
Sco
re (
%)
Optimal-best MRC
Best-second best MRC
Score error (from optimal)
ocho854
0
5
10
15
20
0 50 100 150 200 250 300 350 400 450 500
Iteration
RF
rat
e (%
)
0.0001
0.001
0.01
0.1
1
MP
Sco
re (
%)
Optimal-best MRC
Best-second best MRC
Score error (from optimal)
mari2594
0
5
10
15
20
0 50 100 150 200 250 300 350 400 450 500
Iteration
RF
rat
e (%
)
0.0001
0.001
0.01
0.1
1
MP
Sco
re (
%)
Optimal-best MRC
Best-second best MRC
Score error (from optimal)
Online ConsensusInput: T1, T2, …, Tk with n leaves, one at a time
Output: Majority Consensus tree Mi of T1,…,Ti
Solution: Maintain set of clades C with counters
When Ti arrives, need to consider only the clades in Ti and Mi-1, total of 2n
Data structure Time Space
Self balancing binary tree O(n lg n) O(|C|)
Hash table, h=O(n2) O(n) O(n2)
Conclusions and Future
• Evidence that can stop parsimony search early
• Need simulations and more data to verify
• Collect other (than consensus) statistics
• Other stopping criteria
• Different representation of finalsets of trees
• Other methods
Wait! There is more!Part II: Heterogeneous Data
(joint work with Tandy Warnow)
Heterogeneous Data
Molecular data: DNA and genomes
Pros Cons
• Have distance measure
• Unambiguous• Many characters
• No data for extinct species
• Difficulties with ancient evolutionary events
• Recombination, repeated evolution
Heterogeneous Data
Paleontological, morphological, geographical, historical data
Pros Cons
• Easy to sample• Sometimes is the
only available information
• Has been used for a century
• Character states hard to determine
• Genetic basis not known
• No distance measure• Subjective
Data As ConstraintsConstraints, not distance!• Positive: these species are together
(phylogenetic trees, presence of a morphological character)
• Negative: these species are not together (above + geography, fossils)
• Temporal: these events happened in this order (fossils, history)
• Frequency: this even happens more often than another (adaptation mechanisms)
E
ABCD
Consensus Methods: Greedy
E
ABCD
E
ABCD
E
ABCD
AB CD ABCDABCDE
AB ABC DEABCDE
BCD ABCDABCDE
Greedy: resolves majority by adding compatible clades
E
ABCD
AB CD ABCD
E
ABCD
AB ABC DE
E
ABCD
Consensus Methods: AMTPhillips & Warnow (95)
E
ABCD
E
ABCD
E
ABCD
AB CD ABCDABCDE
AB ABC DEABCDE
BCD ABCDABCDE
Asymmetric Median Tree: maximum (weighted) collection of compatible clades
ABABC
ABCD
BCDDE
CD
AB CD ABCD ABCDE
AB ABC ABCD ABCDE
AB CD ABCD ABCDE
Consensus of Positive Constraints
Formalize constraint, go through existing consensus methods, see if satisfies or can be extended
Positive Constraints Strict + res Maj + res Grdy AMT Input
All input have isomorphic T... all output have T One input has isomorphic T, no contradictions output have T All input have clade all output have One input has clade , no con- tradictions output have
ππ
ππ
Partially from Steel et al. 2000
1. a and b are separated by C
2. C is closer to a than b – same as positive
Negative Constraints Strict + res Maj + res Grdy AMT Input
All input have 1 .all output…. have 1 One input has 1, no contradictions output have 1
Consensus of Negative Constraints
Conclusions and Future (Part II)
• Existing methods are insufficient
• (Consensus with respect to temporal, frequency constraints)
• Developing new methods that preserve 4 types of constraints
• Network phylogeny
• Error measure and evaluation of quality
Even Bigger Future• Phylogeny
• Getting good reconstructions fast• Heterogeneous data• Network phylogeny
• Epidemiology• Flu SIR model, combining data• Vaccination strategies
• Population biology• Discrete methods for small populations
(esp. conservation)
Work is supported by the National Science Foundation postdoctoral
fellowship grant EIA 02-03584
Thank you
Controlled Breeding(joint work with Cris Moore and Jared Saia)
Given an initial population of animals design a mating strategy that achieves a
breeding goal (within shortest time)
Controlled Breeding: Background
• Conservation Biology and Agriculture
• Breeding strategies: designed and evaluated empirically or using stochastic time-step modeling
• Empirical evaluation – too slow!
• Stochastic modeling – mathematically and biologically inappropriate.
• Classic algorithm design problem
Breeding All Possible Animals
Given k binary strings of length nDesign an algorithm that Produces all possible strings With the smallest expected # matings
Greedy: mate two animals with the highest probability of producing new
Upper bound: 2.32•2n
Breeding a Target Animal
Given k strings of length nDesign an algorithm that Produces a target string With the smallest expected # matings
Alg 1: breed for one trait at a timeO(n lg n)
Alg 2: breed the animals closest to the target
O(n2)
Algorithm: One Trait at a TimeAddOneTrait (11…100...0, 00…010…0)
x = 11…100…0y = 00…010…0While (y has < i+1 ones) do
Mate x and y twicey = string with 1 in bit (i+1)
Return y
The Algorithm (e1,e2,…,en)x = e1
For x = 2..n dox = AddOneTrait(x,ei)
More Realistic Breeding
• Gender
• Variable probability of outcome
• Deaths
• Minimize number of generations
• Goal: maximum diversity
• On-line: maintain the distribution
•
Recommended