29
Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution Reed A. Cartwright Department of Genetics University of Georgia

Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

Embed Size (px)

DESCRIPTION

Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution. Reed A. Cartwright Department of Genetics University of Georgia. Phylogenies. Phylogenies are not known. They are inferred. - PowerPoint PPT Presentation

Citation preview

Page 1: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

Reed A. CartwrightDepartment of GeneticsUniversity of Georgia

Page 2: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

2

Phylogenies

Phylogenies are not known. They are inferred.

The accuracy of the inference is dependent on the quality of the data and quality of the methodology.

Page 3: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

3

Estimating Phylogenies

Retrieve sequences from a database or a sequencer.

Align sequences. Estimate the phylogeny from the

alignment. However, bad alignments give bad

phylogenies.

Page 4: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

4

How do we know it works?

Intuition Concordance Using rare instances of know

phylogenies. Simulations

Page 5: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

5

Why Simulate Phylogenies?

Techniques are often based on certain models of evolution.

Simulating sequence evolution based on these models produces an ideal situation to test the techniques.

Using other models can test how robust a technique is.

Page 6: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

6

Testing Procedure

A B C D

A AATTCTTTGAGTTAAB AATTCTTTGAGTTAAC AATTCTTAAAGTTAAD AATTCTTAAAGTTAA

A AAAAGATAAAGCAAA--AB GAAAGATAAAGCAAA--AC GAAAGATAAAGAAAAACAD GAAAGATAAAGAAAAACA

A B C D

A B C D

1. Start with a “known” tree.

2. Simulate sequencesets based on the tree.

3. Estimate the treesof the simulated data.

4. Compare estimated treesto the original tree.

Page 7: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

7

Measuring the accuracy of estimates

What do you want to measure, topology or branch lengths?

Simply binary system: correct or wrong.

More flexible system: accuracy of clade estimation.

Page 8: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

8

Clade Estimation

Sensitivity = TP/(TP+FN) Specificity = TN/(FP+TN) Positive Predictive Accuracy = TP/(TP+FP) Negative Predictive Accuracy = TN/(FN+TN)

Estimated Clade

Yes No

Actual Clade

Yes True Positive False Negative

No False Positive True Negative

Page 9: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

9

How to best combine different clades into one metric?

Look at each clade separately? Lump all clades of the same size

together? Lump all clades of different sizes

together? How to do it?

Page 10: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

10

Alignment Techniques

The quality of a phylogeny depends on the quality of an alignment.

There are two different classes of alignment techniques. Pairwise alignments Multiple alignments

Page 11: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

11

Pairwise alignments

Pairwise alignments align pairs of sequences.

Typically use a spreadsheet like technique with penalties for mismatches and gaps.

Page 12: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

12

Multiple alignments

Align multiple sequences. Cannot directly use the techniques

used for pairwise alignments. Typical implementations use a

guide tree derived from sequence similarity scores of pairwise alignments.

Page 13: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

13

Alignment Models

The typical model of alignment is the affine gap model.

In this model the cost of a gap is a linear function of the size of a gap: C(L) = O+E(L).

This corresponds to a geometric-exponential model of gap sizes.

Page 14: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

14

Power-Law

The only problem with this is that indels do not obey this model.

Several studies and some theory in nucleic acids and proteins have found that indels sizes obey a power law.

The appropriate cost model for a power law is logarithmic:C(L) = O+E*Log(L).

Page 15: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

15

So what?

Only one program does logarithmic alignments, and it only works on protein pairs.

Monotone by Richard Mott. Clustal uses the affine gap model. Above that Clustal uses a simple

evolutionary model to estimate a guide tree for aligning multiple sequences.

Page 16: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

16

Dealing with gaps

Sequences are typically aligned before they are phylogenized.

This is silly. We should estimate alignments at the same we estimate phylogenies.

For now we are stuck doing it in pieces, but must be wary of introducing biases into our phylogenies when aligning.

Page 17: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

17

Dealing with gaps

Gaps contain phylogenetic signal which is usually ignored by researchers.

Can look at how gaps influence phylogenetics?

Page 18: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

18

Dealing with gaps

To study how gaps can influence phylogenies we need a program that can simulate molecular evolution with indels.

However, existing programs model indel formation rather poorly if they do at all.

Page 19: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

19

Dawg is the solution to this gap

Dawg stands for “DNA Assembly With Gaps.”

A portable and robust program for simulating molecular evolution.

Development Website: http://scit.us/dawg/

Page 20: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

20

Comparing Software

Feature Seq-Gen Evolver Rose Dawg

Indels Yes Yes

Indel Parameter Estimator

Yes

Recombination Yes Yes

Substitution GTR GTR PAM GTR

Rate Heterogeneity Γ+I Γ Γ+I Γ+I

Input Format Switch File File File

Unix Yes Yes Yes Yes

Mac OS X Yes Yes Yes Yes

Win32 Yes Yes Yes

Page 21: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

21

Parameters Tree phylogeny TreeScale coefficient to scale branch lengths by Sequence root sequences Length length of generated root sequences Rates rate of evolution of each root nucleotide Model model of evolution: GTR|JC|K2P|K3P|HKY|F81|F84|TN Freqs nucleotide (ACGT) frequencies Params parameters for the model of evolution Width block width for indels and recombination Scale block position scales Gamma coefficients of variance for rate heterogeneity Alpha shape parameters Iota proportions of invariant sites GapModel models of indel formation: NB|PL|US Lambda rates of indel formation GapParams parameter for the indel model Reps number of data sets to output File output file Format output format: Fasta|Nexus|Phylip|Clustal GapSingleChar output gaps as a single character GapPlus distinguish insertions from deletions in alignment LowerCase output sequences in lowercase Translate translate outputed sequences to amino acids NexusCode text or file to include between datasets in Nexus format Seed PRNG seed (integers)

Page 22: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

22

Sample Input File# example.dawgTree = ((AY727331:0.001359,AY727330:0.001359):0.084512,

(AY727327:0.006116,AY727326:0.006116):0.079756);Model = "GTR"Params = {1.08031, 2.45581, 0.44452, 1.09145, 4.06519, 1.00000}Freqs = {0.353470, 0.143681, 0.178206, 0.324643}Length = 300Lambda = 0.143120GapModel = "NB"GapParams = {1, 0.753247}Format = "Clustal"File = "example.aln"Seed = 1981

Page 23: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

23

CLUSTAL multiple sequence alignment (Created by DAWG Version 1.0.0)

AY727326 TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAAAAAGATAAAGCAAA--AAY727327 TTCGAAAATATGTTAGTACTCAATATGAATTCTTTGAGTTAAGAAAGATAAAGCAAA--AAY727330 TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACAAY727331 TTCAAAAATATGCTAGGACTGAATATGAATTCTTAAAGTTAAGAAAGATAAAGAAAAACA

AY727326 ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATTAY727327 ATACATAATGTGATTTCAATATTCCAATTACCTAACAATACGGCTATCAATTAAACGATTAY727330 GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATTAY727331 GTACATAATGTAAA----TTATTGCAA---------AAAACGGCTAACAATTAGACGATT

AY727326 TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTAACATCATGTTGTATTTAGATAY727327 TTAGGATTACACCGACAAATATTAGGCCGATATGAATTTACCATCATGTTGTATTTAGATAY727330 TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGATAY727331 TTAGGATTACGCTGACAAATATTAGGATGATATTAATTTA------TCTTGTATTTAGAT

AY727326 GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGCATTTAAGAAGTACATAY727327 GCTGTCTTTTATTAACATTCATCATTAAAT-TTGGAACCTTTTGTATTTAAGAAGTACATAY727330 GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACATAY727331 GCTGTCTTTTATCAACATTCATCACTAGATATTGGAACCTATTGCATCTAAGAAGTACAT

AY727326 GTTTAATAGTGTTTAAAA-TATATATGAAATTGATCATAAGGA---TCTATAAATGCGGTAY727327 GTTTAATAGTGTTTATAA-TATATATGAAATTGATCGTAAGGA---TCTATAAATGCAGTAY727330 GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGCAY727331 GTTTAATAGGGTT-AAAACTATATATGAAGTCGATTATAAGGAATTTCTATAAATGTAGC

AY727326 TCTTCAATTTCTTGAY727327 TCTTCAATTTCTTGAY727330 TCTTCAATTTCCTAAY727331 TCTTCAATTTCCTA

Page 24: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

24

Estimating Indel Rate

Dawg would be of little benefit if biologists could not estimate parameters of indel formation from real data.

Dawg’s indel model allows such estimation, which is implemented in a Perl script, lambda.pl.

Page 25: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

25

Lambda.pl

Take an alignment and a phylogeny. The number of unique gaps in this

alignment is approximately distributed as a Poisson with mean (λLT) λ = rate of indel formation L = average sequence length T = total branch length

Therefore the rate of indel formation can be estimated as λ = N/(LT) N = number of unique gaps in alignment

Page 26: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

26

Example Usage:Confidence Interval of Indel Rate

I aligned the sequences of chloroplast trnK introns from two Hibiscus and two Prunus species.

Using Paup*, I estimated the phylogeny and substitution parameters.

Using lambda.pl, I estimated the indel formation parameters.

Page 27: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

27

Example Usage

From these estimated parameters of evolution, I constructed an input file for Dawg.

From the input file Dawg produced a thousand simulated sequence sets.

The rate of indel formation was estimated for each of the simulated sequences.

Page 28: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

28

Results

The estimated rate of indel formation was 0.143120.

Bootstrapping gave a 95% CI of 0.078530 to 0.213560.

Biologically this is 8 to 21 indels per 100 substitutions.

Page 29: Barking up the wrong tree: gaps in current phylogenetic methodology and a dawg-gone good solution

3.12.2005 RA Cartwright [email protected] - http://scit.us/

29

Thanks