Linkage Analyisis II Phylogenic trees. Goals Given marker genes, we want to know the position of genes linked to a given phenotype. We use genes with

Linkage Analyisis IIPhylogenic trees

Goals

Given marker genes, we want to know the position of genes linked to a given phenotype.

We use genes with multiple alleles (for the marker). For example a gene in which a SNP occurred and is present in some of the population.

We assume H.W distribution of the two alleles in the population, and correlate the presence or absence of genes in pedigrees with the known phenotype:

Example:In families where the father has the marker A and A’ (An

heterozygous father) the phenotype occurs in 35 % of the children carrying A and 12 % of the children carrying A’. What can we say about the distance between the gene causing the disease and the marker ?

Linkage Analysis

Assuming that there is gamete equilibrium at the A and B loci, in parent 1 there is a probability of 1/2 that alleles A1 and B1 will be coupled, and a probability of 1/2 that they will be in repulsion.

In other words, we do not know in what chromosome are the gene present.

Linkage Analysis

(1) A1 and B1 are coupled, The probability that parent (1) provides the

gametes A1B1 and A2B2 is (1-)/2 and the probability that this parent provides gametes A1B2 and A2B1 is /2. The probability that the couple will have child of type (1) or (2) is (1-)/2, and that of their having a type (3) or type (4) child is /2.

The probability of finding n1 children of type (1), n2 of type (2), n3 of type (3) and n4 of type (4) is therefore.

[(1- )/2]n1+n2 x (/2)n3+n4.

Linkage Analysis

(2) A1 and B1 are in a state of repulsion. The probability that parent (1) provides

the gametes A1B2 and A2B1 is (1-)/2 and the probability that this parent provides gametes A1B1 and A2B2 is /2.

The probability of the previous observation is therefore:

(/2)n1+n2 x[(1-)/2]n3+n4.

Linkage analysis

With no additional information about the A1 and B1 phase, and assuming that the alleles at the A and B loci are in a state of coupling equilibrium, the probability of finding n1, n2, n3 and n4 children in categories (1), (2), (3), (4) is: p(n1,n2,n3,n4/)=1/2{[(1 -)/2]n1+n2 x (/2)n3+n4 + (/2) n1+n2 x [(1-)/2] n3+n4}

So the liklihood of for an observation n1, n2, n3, n4 can be written:

L(q/n1,n2,n3,n4)=1/2 {[(1-)/2]n1+n2 (/2)n3+n4 + (/2) n1+n2 [(1-)/2] n3+n4}

Special case: number of children n= 1

Regardless of the category to which this child belongs

L() = 1/2 [(1-)/2] + 1/2 [/2] = 1/4 The likelihood of this observation for the

family does not depend on . We can say that such a family is not informative for .

Informative families

An "informative family" is a family for which the liklihood is a variable function of .

One essential condition for a family to be informative is, therefore, that it has more than one child. Furthermore, at least one of the parents must be heterozygotic.

Definition: if one of the parents is doubly heterozygotic and the other is A double homozygote, we have a backcross A single homozygote, we have a simple backcross A double heterozygote, we have a double intercross

Definition Of The "Lod Score" Of A Family

Take a family of which we know the genotypes at the A and B loci of each of the members.

Let L() be the likelihood of a recombination fraction 0 and < 1/2 L(1/2) is the likelihood of = 1/2, that is of independent segregation into

A and B. The lod score of the family in is:

Z() = log10 [L()/L(1/2)]

Z can be taken to be a function of defined over the range [0,1/2]. The likelihood of a value of for a sample of independent families is the

product of the likelihoods of each family, and so the lod score of the whole sample will be the sum of the lod scores of each family.

Test For Linkage Several methods have been proposed to detect linkage: "U

scores", were suggested by Bernstein in 1931, "the sib pair test" by Penrose in 1935, "likelihood ratios" by Haldane and Smith in 1947, "the lod score method" proposed by Morton in 1955 (1). Morton’s method is the one most commonly used at present.

The test procedure in the lod score method is sequential (Wald, 1947 (2)). Information, i.e. the number of families in the sample, is accumulated until it is possible to decide between the hypotheses H0 and H1 :

H0 : genetic independence = 1/2 H1: linkage of 1 0 < 1 < 1/2 The lod score of the 1 sample Z(1) = log10 [L(1)/L(l/2)] indicates the relative probabilities of finding that the sample is

H1 or H0. Thus, a lod score of 3 means that the probability of finding that the sample is H1 is 1000 times greater than of finding that it is H0 ("lod = logarithm of the odds").

Test For Linkage

The decision thresholds of the test are usually set at -2 and +3, so that if: Z(1) > 3 H0 is rejected, and linkage is accepted. Z(1) < -2 linkage of 1 is rejected. -2 < Z(1) < 3 it is impossible to decide between

H0 and H1. It is necessary to go on accumulating information.

For the thresholds chosen, -2 and +3, we can show that:

The first degree error, (False negative) < 10-3 The second degree error (false positive), < 10-2 The reliability, 1- > 0.95 1 is the probability that we conclude that H1 is true

given H0

Recombination Fraction For A Disease Locus And A Marker Locus

Let us assume we are dealing with a disease carried by a single gene, determined by an allele, g0, located at a locus G (g0: harmful allele, G0: normal allele).

We would like to be able to situate locus G relative to a marker locus T, which is known to occupy a given locus on the genome. To do this, we can use families with one or several individuals affected and in which the genotype of each member of the family is known with regard to the marker T.

In order to be able to use the lod scores method described above, what is needed is to be able to extrapolate from the phenotype of the individuals (affected, not affected) to their genotype at locus G (or their genotypical probability at locus G).

Disease and Marker Locus

What we need to know is: the frequency, g0. the penetration vector f1, f2,f3. f1 = Pr (affected /g0g0). f2 = Pr (affected /g0G0). f3 = Pr (affected /G0G0).

It will often happen that the information available for the marker is not also genotypic, but phenotypic in nature. Once again, all possible genotypes must be envisaged.

Disease and Marker Locus

As a general rule, the information available about a family concerns the phenotype. To calculate the likelihood of we must envisage all the possible genotype configurations at each of the loci, for this family, writing the likelihood of for each configuration, weighting it by the probability of this configuration, and knowing the phenotypes of individuals in A and B.

Knowledge of the genetic parameters at each of the loci (gene frequency, penetration values) is therefore necessary before we can estimate

Linkage Analysis For Three Loci : Interference

Now let us consider three loci A, B and C. Let the recombination fraction between A and B be 1, that between B and C be 2 and that between A and C be 3.

Let us consider the double recombinant event, firstly between A and B, and secondly between B and C. Let R12 be the probability of this event. If the crossings-over occur independently in segments AB and BC, then:

R12

Interference

If this is not the case, an interference phenomenon is occurring and

R12 = C 1 2 where C1 If C < 1 the interference is said

to be positive; and crossings-over in segment AB inhibit those in segment BC.

If C >1 the interference is said to be negative; and crossings-over in segment AB promote those in segment BC.

Let us consider the case of a triple heterozygotic individual.

Such an individual can provide 8 types of gametes. 1221

222

111

3

122122

211

121112

221

12212

121

1 RCBA

CBA

RCBA

CBA

RCBA

CBA

RCBA

CBA

Interference

We can write that :

RC1

If C = 1

= 1 + - 21 The recombination fraction is a non-additive measurement.

However, we can write (1-2 ) = (1-2 )(1-2 ) if x() = k Log (1-2) then we have x() = x() + x() and for k = -1/2, x()~ for small values of . x() = -1/2 Log (1-2

) is an additive measurement. It is known as the genetic distance, and is measured in

Morgans. It can be shown that x measures the mean number of crossings-over.

Test for the presence of interference

Let us consider a sample of families with the genotypes A, B and C. Let Lc be the greatest likelihood for 1, 2, and L1 the greatest likelihood when we impose the constraint C=1

(i.e. = 1 + - 21 )

Then -2 Log (L1/Lc ) follows a 2 pattern, with one degree of freedom.

Genetic Heterogeneity Of Localization

The analysis of genetic linkage can be complicated by the fact that mutations of several genes, located at different places on the genome, can give rise to the same disorder. This is known as genetic heterogeneity of localization.

One of the following two tests is used to identify heterogeneity of this type: The "Predivided sample test" The "Admixture Test".

The first test is usually only appropriate if there is a good family stratification criterion or if each family individually has high informativity.

The Predivided Sample Test

This test is intended to demonstrate linkage heterogeneity in different sub-groups of a sample of families. The aim is to test whether the genetic linkage between a disease and its marker(s) is the same in all sub-groups. These groups are formed ad hoc on the basis of clinical or geographical criteria etc....

Let us assume that the total sample of families has been divided into n sub-groups (it is possible to test for the existence of as many sub-groups as families). i denotes the true value of the recombination fraction of sub-group i.

The Predivided Sample Test

We want to test the null hypothesis H0: 1= 2 = 3 = …= n against the alternative hypothesis H1: the values of i are not all equal.

Therefore, the quantity

Follows a distribution with (n-1) degrees of freedom.

iiii ZZQ )ˆ()(log2 10

The Admixture Test

The "admixture test" is not based on an ad hoc subdivision of the families. It is assumed that among all the families studied genetic linkage between the disease and the marker is found only in a proportion of the families, with a recombination fraction < 1/2. In the remaining (l-) families, it is assumed that there is no linkage with the marker (=1/2).

For each family i of the sample, the likelihood is calculated

Li() = Li() + (1-) Li(1/2),

The Admixture Test

where Li() is the likelihood of for family i. The likelihood of the couple (, ) is defined by the product of the likelihoods associated with all the families :

L(, )= Li(, )

We test to find out whether a is significantly different from 1 by comparing Lmax( = 1, ), the maximized likelihood for assuming homogeneity, and Lmax(, ), the maximized likelihood for the two parameters and (nested models). Then variable Q =2[Ln Lmax (, ) —Ln Lmax (= 1, )] follows a distribution with one degree of freedom.

Generalization Of The Admixture Test

In some single-gene diseases, several genes have been shown to exist at different locations. This is true, for example of multiple exostosis disease, for which 3 genes have been identified successively on 3 different chromosomes.

The "admixture test" is then extended to determine the proportion of families in which each of the three genes is implicated , and the possibility that there is a fourth gene.

The three locations on chromosomes 8, 19 and 11 were reported as El, E2 and E3, and the proportions of families concerned as 1, 2 and 3 respectively. 4 was used to represent the proportion of the families in which another location was involved.

Generalization Of The Admixture Test

For each family i of the sample, the likelihood was calculated using the observed segregation within the family of the markers available in each of the three regions, according to the clinical status of each of its members.

Li(El, E2, E3, |Fi) = (L(E1|Fi)/L(E1=1/2 |Fi)] + (L(E2|Fi)/L(E2=1/2 |Fi)] + 3 [L(E3|Fi)/L(E3=1/2 | Fi)]

+ 4

For all the families Li(El, E2, E3, |Ft) = i Li(El, E2, E3, | Fi) Each i can be tested to see if it is equal to 0, and then the corresponding non null i and Ei values are estimated.

Generalization of The Admixture Test -Results

It is also possible to calculate the probability that the gene implicated is at El, E2 or E3 for each of the families in the sample. The post hoc probability makes use of the estimated i proportions, but also the specific observations in this family.

The sample investigated has been shown to consist of three types of families: in 48% of families, the gene is located on chromosome 8, in 24% of them on chromosome 19, and in 28% of families the gene is located on chromosome 11. There was no evidence of a fourth location in this sample.

The post hoc probabilities of belonging to one of these 3 sub-groups were then estimated: the probability that the gene implicated would be on chromosome 8 was over 90% for 5 families, that it would be on chromosome 19 for 3 of them, and that it would be on chromosome 11 for 4 families. For the other families, the situation was less clear-cut: the post-hoc probabilities are similar to the ad hoc probabilities because of the paucity of information provided by the markers used.

Phylogenic trees.

Carolus Linnaeus (1707-1778)

Classifying Organisms

Nomenclature is the science of naming organisms

Names allow us to talk about groups of organisms.

- Scientific names were originally descriptive phrases; not practical

Binomial nomenclature Developed by Linnaeus, a Swedish

naturalist Names are in Latin, formerly the

language of science binomials - names consisting of two parts The generic name is a noun. The epithet is a descriptive adjective. Thus a species' name is two words

e.g. Homo sapiens

Classifying Organisms

Ta xo no m ic C la ssific a tio n o f M a n Ho m o sa p ie ns

Sup e rking d o m : Euka ryo ta King d o m : M e ta zo a Phylum : C ho rd a ta C la ss: M a m m a lia O rd e r: Prim a ta Fa m ily: Ho m inid a e G e nus: Sp e c ie s:

Ho m osa p ie ns

Sub sp e c ie s: sa p ie ns Evol u

tionary

di s

tanc e

Taxonomy is the science of the classification of organisms

Taxonomy deals with the naming and ordering of taxa. The Linnaean hierarchy: 1. Kingdom 2. Division 3. Class 4. Order 5. Family 6. Genus 7. Species

Willi Hennig (1913-1976)

Phylogenetics

Evolutionary theory states that groups of similar organisms are descendedfrom a common ancestor.

Phylogenetic systematics is a method of taxonomic classification based on their evolutionary history.

It was developed by Hennig, a German entomologist, in 1950.

Phylogenetics

Who uses phylogenetics? Some examples:

Evolutionary biologists (e.G. Reconstructing tree of life)

Systematists (e.G. Classification of groups)

Anthropologists (e.G. Origin of human populations)

Forensics (e.G. Transmission of HIV virus to a rape victim)

Parasitologists (e.G. Phylogeny of parasites, co-evolution)

Epidemiologists (e.G. Reconstruction of disease transmission)

Genomics/Proteomics (e.G. Homology comparison of new proteins)

Phylogenetic Trees

Node: a branchpoint in a tree (a presumed ancestral OTU)

Branch: defines the relationship between the taxa in terms of descent and ancestry

Topology: the branching patterns of the tree

Branch length (scaled trees only): represents the number of changes that have occurred in the branch

Root: the common ancestor of all taxa

Clade: a group of two or more taxa or DNA sequences that includes both their common ancestor and all their descendents

Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user to be used in a study, such as individuals, populations, species, genera, or bacterial strains

Phylogenetic Trees

Clade

Sp e c ie s A

Sp e c ie s E

Sp e c ie s D

Sp e c ie s C

Sp e c ie s BRoot

Branch

Node

There are many ways of drawing a tree

=

A EDCB E DC B A

Phylogenetic Trees

A EDCBA EDCB

= =

A EDCBE DC B A

=

Phylogenetic Trees

A EDCB A EDCB

Bifurcation

Trifurcation

=/

Bifurcation versus Multifurcation (e.g. Trifurcation)

Multifurcation (also called polytomy): a node in a tree that connects more than three branches. If the tree is rooted, then one of the branches represents an ancestral lineage and the remaining branches represent descendent lineages. A multifurcation may represent a lack of resolution because of too few data available for inferring the phylogeny (in which case it is said to be a soft multifurcation) or it may represent the hypothesized simultaneous splitting of several lineages (in which case it is said to be a hard multifurcation).

Phylogenetic Trees

Trees can be rooted or unrooted

A

E

D

C

BA

E

D

C

B

Phylogenetic Trees

Trees can be scaled or unscaled (with or without branch lengths)

A

E

D

C

B

A

E

D

C

B

A

E

D

C

B

A

E

D

C

B

unit

unit

Phylogenetic Trees

Phylogenetic trees

Possible evolutionary trees

Taxa (n) rooted

(2n-3)!/(2n-2(n-2)!)

unrooted

(2n-5)!/(2n-3(n-3)!)

2 1 1

3 3 1

4 15 3

5 105 15

6 954 105

7 10,395 954

8 135,135 10,395

9 2,027,025 135,135

10 34,459,425 2,027,025

Phylogenetic Trees

• Genes vs. Species.

• Relationships calculated from sequence data represent the relationships between genes, this is not necessarily the same as relationships between species.

• Your sequence data may not have the same phylogenetic history as the species from which they were isolated.

• Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).

helix

sheet

Phylogenetic Inference

Different genes will be best suited to solve different problems:

- The RNA genomes of HIV viruses change so quickly that every person infected carries a different strain.

- Certain enzymes may evolve relatively fast to allow for phylogeographic studies of species distribution post-glaciation.

- mitochondrial DNA has a relatively fast substitution rate (evolves quickly) – can be used to establish relatively recent divergence.

- For establishing ‘deep phylogeny’ we need genes that change very slowly (highly conserved ones).

helix

sheet

Phylogenetic Inference

- Different sequences accumulate changes at different rates - chose level of variation that is appropriate to the group of organisms being studied.

- Proteins (or protein coding DNAs) are constrained by natural selection- some sequences are highly variable (rRNA spacer regions, immunoglobulin genes), while others are highly conserved (actin, rRNA coding regions).

- Different regions within a single gene can evolve at different rates (conserved vs. Variable domains).

Rat 0.0000 0.0646 0.1434 0.1456 0.3213 0.3213 0.7018

Mouse 0.0646 0.0000 0.1716 0.1743 0.3253 0.3743 0.7673

Rabbit 0.1434 0.1716 0.0000 0.0649 0.3582 0.3385 0.7522

Human 0.1456 0.1743 0.0649 0.0000 0.3299 0.2915 0.7116

Opossum 0.3213 0.3253 0.3582 0.3299 0.0000 0.3279 0.6653

Chicken 0.3213 0.3743 0.3385 0.2915 0.3279 0.0000 0.5721

Frog 0.7018 0.7673 0.7522 0.7116 0.6653 0.5721 0.0000

Genetic Distance

–

Tree building methods

Genetic Distance Unweighted Pair Group (UPGMA) Neighbor-Joining Fitch & Margoliash

Character-State Maximum Parsimony Maximum Likelihood

Tree Building (Distance Based)

UPGMA

- The simplest of the distance methods is the UPGMA (Unweighted pair group method using arithmetic averages)

- Many multiple alignment programs such as PILEUP use a variant of UPGMA to create a dendrogram of DNA sequences which is then used to guide the multiple alignment algorithm

UPGMA Step 1combine B and C

a

e c b

d

A B C D EA 0 10 12 10 7B 0 4 4 13C 0 6 15D 0 13E 0

UPGMA step 2combine BC and D

a

e c b

d

2 2

A BC D EA 0 11 10 7BC 0 5 14D 0 13E 0

UPGMA step 3combine A and E

a

e c b

d

2

2.5 .5

2

A BCD EA 0 10.5 7BCD 0 13.5E 0

UPGMA step 4combine AE and BCD

a

e

c b

d 3.5

3.5

2

2.5 .5

2

AE BCDAE 0 12BCD 0

UPGMA Result

A B C D EA 0 10 12 10 7B 0 4 4 13C 0 6 15D 0 13E 0

a

e

c b

d

3.5

3.5

2

2.5

.5

2

2 .5

1.5

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

Maximum Parsimony

How many possible unrooted trees?

1

3

2

4

1

2

3

4

1

4

3

2

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G

Maximum Parsimony

How many substitutions?

A

A

G

GA G

1 change

A

A

G

GG A

5 changes

1

2

3

4

tree

MP

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0

0

0

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10




4 - A A T G T T G T C G0 3

0 3

0 3

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

4

1 - G

2 - C

3 - T

4 - A

1

2

3

4A

G

C

T

C

A

G

T

C1

3

2

4C

C

G

A

T1

4

3

2C

3

3

3

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10




4 - A A T G T T G T C G0 3 2

0 3 2

0 3 2

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10




4 - A A T G T T G T C G

0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G

1

2

3

4G

G

A

A

A

G

G

A

A1

3

2

4A

G

A

A

G1

4

3

2A

2

2

1

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10





0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

0 3 2 2 0 1 1 1 1 3 14

0 3 2 1 0 1 2 1 2 3 15

0 3 2 2 0 1 2 1 2 3 16

1

3

2

4

1

2

3

4

1

4

3

2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10





0 3 2 2 0 1 1 1 1 3 14

1

2

3

4

Assessing Phylogenetic Data

How much support is there for a particular clade?

Ochromonas

Symbiodinium

ProrocentrumLoxodesSpirostomum um

Tetrahym ena

EuplotesTracheloraphis

Gruberia

71

26

1659

1621

Ochromonas

Symbiodinium


Tetrahym ena


Gruberia

71

59

Bootstrapping/Jack-knifing:

Lots of randomized data sets are produced by sampling the real data with replacement

(or in jackknifing, by removing some random proportion of the data);

Frequencies of occurrence of groups are a measure of support for those groups

- Bootstrap proportions aren’t easily interpretable

- no indication for how good the data are but simply for how well the tree fits the data

Problems:

Maximum Likelihood

Find the most likely tree under a given model of evolution by screening all possible trees.

Sum log likelihood over all sites Model involves:

probability of finding different bases at root; probability of different kinds of changes between

bases; substitution rate(s) length of time

L

C C C G

= Prob + ........+ Prob

C C C G

A

A

C C C G

T

T

Computing the Likelihood of a Tree

Need a model of evolutionary change. To keep simple, assume :

independence of evolutionary at different sites. expected # subs along any branch is function of

substitution rate and length of branch in evolutionary time.

subst rate same throughout tree (makes branch lengths equivalent to evolutionary time)

No consideration for deletion and insertion

The likelihood of the tree is the product of the probabilities of change in each tree segment

L = s0Ps0s6(v6) Ps6s1(v1)Ps6s2(v2)Ps0s8(v8)

Ps8s3(v3)Ps8s7(v7)Ps7s4(v4)Ps7s5(v5)

Where si is the state at point i on the trees, and v’s are the lengths of the segments

Tree evaluation

- Characters are resampled with replacement to create many bootstrap replicate data sets

- Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.)

- Agreement among the resulting trees is summarized with a majority-rule consensus tree

- Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups

Jack-knifing

Bootstrapping:

Ochromonas

Symbiodinium


Tetrahym ena


Gruberia

71

26

1659

1621

Ochromonas

Symbiodinium


Tetrahym ena


Gruberia

71

59

- Jack-knifing is very similar to bootstrapping and differs only in the character resampling strategy

- Jack-knifing is not as widely available or widely used as bootstrapping

- tends to produce broadly similar results


Is there signal in the data?

Possible approach: Random Permutations

- - Random permutation destroys any correlation among characters to thatRandom permutation destroys any correlation among characters to that expected by chance alone expected by chance alone

- It preserves number of taxa, characters and character states in each- It preserves number of taxa, characters and character states in each character (and the theoretical maximum and minimum tree lengths character (and the theoretical maximum and minimum tree lengths

12345678ab c d

AAA A AAA A

T T TTT TT

G G GG G GGG

GC C CC CAAT

characters

taxa

12345678ab c d

A A AA AA A AA

CCC C

C

G G GGGG

TT TT

T TT T

G

G AG

characters

taxa

Original structured data with strongcorrelations among characters

Randomly permuted data with any correlationamong characters due to chance


Matrix Randomization Tests

12345678ab c d

AAA A AAA A

T T TTT TT

G G GG G GGG

GC C CC CAAT

characters

taxa

12345678ab c d

A A AA AA A AA

CCC C

C

G G GGGG

TT TT

T TT T

G

G AG

characters

taxa

Original structured data with strongcorrelations among characters

Randomly permuted data with any correlationamong characters due to chance

Compare some measure of data quality/hierarchical structure for the real and Compare some measure of data quality/hierarchical structure for the real and many randomly permuted data setsmany randomly permuted data sets

This allows us to define a This allows us to define a test statistictest statistic for the null hypothesis that the real data for the null hypothesis that the real data are not better structured than randomly permuted and phylogenetically are not better structured than randomly permuted and phylogenetically uninformative datauninformative data

PTP (permutation tail probability) test

Null Hypothesis:The length of the shortest tree is what you would see given random data

How it works:Reject the null if the real data has shorter tree(the real data is more internally consistent than random data)

Comments:Even a little bit of signal can lead you to reject the null; does not mean phylogenetic signal

M easure of data quality (e .g. tree length , M L ...)

95% cutoff

Pass test

reject nullhypothesis

Fail test

badgood

Fre

qu

en

cy

g1 (skewness of tree length distribution)

Null Hypothesis:The data has no signal and the shortest tree is just the shortest tree from a symmetricdistribution of tree lengths (distribution of tree lengths won’t be skewed if there is no signal)

How it works:Significant left (negative) skew rejects the null of no signal

Comments:Even a little bit of signal can lead you to reject the null; does not mean phylogenetic signal

Tree length

Fre

qu

ency

shortest tree

Tree length

Fre

qu

en

cy

shortest tree

random data showed that the distributionof tree lengths tends to be normal

phylogenetically informative data isexpected to have a strongly skeweddistribution with few shortest treesand few trees nearly as short

g1 (skewness of tree length distribution)

random datag1=-0.100478

real datag1=-0.951947

Documents

Linkage Analyisis II Phylogenic trees. Goals Given marker genes, we want to know the position of genes linked to a given phenotype. We use genes with