65
March 2006 Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner www.cse.ucsd.edu/classes/sp05/cse291 www.cse.ucsd.edu/classes/sp05/cse291

March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

CSE280b: Population Genetics

Vineet Bafna/Pavel Pevzner

www.cse.ucsd.edu/classes/sp05/cse291www.cse.ucsd.edu/classes/sp05/cse291

Page 2: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Population Genetics

• Individuals in a species (population) are phenotypically different.

• Often these differences are inherited (genetic).

• Studying these differences is important!

• Q:How predictive are these differences?

Page 3: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

EX:Population Structure

• 377 locations (loci) were sampled in 1000 people from 52 populations.

• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)

• Genetic differences can predict ethnicity.

AfricaEurasia East Asia

America

Oce

ania

Page 4: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Scope of these lectures

• Basic terminology• Key principles

– Sources of variation– HW equilibrium– Linkage– Coalescent theory– Recombination/Ancestral Recombination Graph– Haplotypes/Haplotype phasing– Population sub-structure– Structural polymorphisms– Medical genetics basis: Association

mapping/pedigree analysis

Page 5: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Alleles

• Genotype: genetic makeup of an individual• Allele: A specific variant at a location

– The notion of alleles predates the concept of gene, and DNA.

– Initially, alleles referred to variants that described a measurable phenotype (round/wrinkled seed)

– Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype.

• Humans are diploid, they have 2 copies of each chromosome.– They may have heterozygosity/homozygosity at a location– Other organisms (plants) have higher forms of ploidy.– Additionally, some sites might have 2 allelic forms, or even

many allelic forms.

Page 6: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

What causes variation in a population?

• Mutations (may lead to SNPs)• Recombinations• Other genetic events (gene conversion)• Structural Polymorphisms

Page 7: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Single Nucleotide Polymorphisms

000001010111000110100101000101010010000000110001111000000101100110

Infinite Sites Assumption:Each site mutates at most once

Page 8: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Short Tandem Repeats

GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC

435335

Page 9: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

STR can be used as a DNA fingerprint

• Consider a collection of regions with variable length repeats.

• Variable length repeats will lead to variable length DNA

• Vector of lengths is a finger-print

4 23 35 13 23 15 3

loci

indiv

idual

s

Page 10: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Recombination

0000000011111111

00011111

Page 11: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Gene Conversion

• Gene Conversion versus crossover– Hard to distinguish

in a population

Page 12: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Structural polymorphisms

• Large scale structural changes (deletions/insertions/inversions) may occur in a population.

Page 13: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Topic 1: Basic Principles

• In a ‘stable’ population, the distribution of alleles obeys certain laws– Not really, and the deviations are

interesting• HW Equilibrium

– (due to mixing in a population)• Linkage (dis)-equilibrium

– Due to recombination

Page 14: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Hardy Weinberg equilibrium

• Consider a locus with 2 alleles, A, a• p (respectively, q) is the frequency of A

(resp. a) in the population• 3 Genotypes: AA, Aa, aa• Q: What is the frequency of each genotype

If various assumptions are satisfied, (such as random mating, no natural selection), Then• PAA=p2

• PAa=2pq• Paa=q2

Page 15: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Hardy Weinberg: why?

• Assumptions:– Diploid– Sexual reproduction– Random mating– Bi-allelic sites– Large population size, …

• Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

Page 16: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Hardy Weinberg: Generalizations

• Multiple alleles with frequencies– By HW,

• Multiple loci?

θ1,θ2,L ,θH

Pr[homozygous genotype i] =θ i2

Pr[heterozygous genotype i, j] = 2θ iθ j

Page 17: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Hardy Weinberg: Implications

• The allele frequency does not change from generation to generation. Why?

• It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the disease?

• Males are 100 times more likely to have the “red’ type of color blindness than females. Why?

• Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Page 18: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Recombination

0000000011111111

00011111

Page 19: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

What if there were no recombinations?

• Life would be simpler• Each individual sequence would have a

single parent (even for higher ploidy)• The relationship is expressed as a tree.

Page 20: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

The Infinite Sites Assumption

0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0

3

8 5

• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.

• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination

Page 21: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Infinite sites assumption and Perfect Phylogeny

• Each site is mutated at most once in the history.

• All descendants must carry the mutated value, and all others must carry the ancestral value

i

1 in position i0 in position i

Page 22: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Perfect Phylogeny

• Assume an evolutionary model in which no recombination takes place, only mutation.

• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

Page 23: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

The 4-gamete condition

• A column i partitions the set of species into two sets i0, and i1

• A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.

• EX: i is heterogenous w.r.t {A,D,E}

iA 0B 0C 0D 1E 1F 1

i0

i1

Page 24: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

4 Gamete Condition

• 4 Gamete Condition– There exists a perfect phylogeny if and only

if for all pair of columns (i,j), j is not heterogenous w.r.t i0, or i1.

– Equivalent to– There exists a perfect phylogeny if and only

if for all pairs of columns (i,j), the following 4 rows do not exist(0,0), (0,1), (1,0), (1,1)

Page 25: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

4-gamete condition: proof (only if)

• Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.

• (only if) Every perfect phylogeny satisfies the 4-gamete condition

• (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0

i1

i

j

Page 26: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Handling recombination

• A tree is not sufficient as a sequence may have 2 parents

• Recombination leads to loss of correlation between columns

Page 27: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No

recombination• Each new individual

chromosome chooses a parent from the existing ‘haplotype’

A B0 10 10 00 01 01 01 01 0

1 0

Page 28: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 2: diploidy and

recombination• Each new individual

chooses a parent from the existing alleles

A B0 10 10 00 01 01 01 01 0

1 1

Page 29: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Linkage (Dis)-equilibrium (LD)

• Consider sites A &B• Case 1: No recombination• Each new individual chooses a

parent from the existing ‘haplotype’

– Pr[A,B=0,1] = 0.25• Linkage disequilibrium

• Case 2: Extensive recombination• Each new individual simply

chooses and allele from either site

– Pr[A,B=(0,1)=0.125• Linkage equilibrium

A B0 10 10 00 01 01 01 01 0

Page 30: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD

• In the absence of recombination, – Correlation between columns– The joint probability Pr[A=a,B=b] is

different from P(a)P(b)• With extensive recombination

– Pr(a,b)=P(a)P(b)

Page 31: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Measures of LD

• Consider two bi-allelic sites with alleles marked with 0 and 1

• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]

– P0* = Pr[Allele 0 in locus 1]

• Linkage equilibrium if P00 = P0* P*0

• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …

Page 32: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD over time

• With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear– Let D(t) = LD at time t– P(t)

00 = (1-r) P(t-1)00 + r P(t-1)

0* P(t-1)*0

– D(t) = P(t)00 - P(t)

0* P(t)*0 = P(t)

00 - P(t-1)0* P(t-1)

*0 (HW)

– D(t) =(1-r) D(t-1) =(1-r)t D(0)

Page 33: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD over distance

• Assumption– Recombination rate increases linearly with

distance– LD decays exponentially with distance.

• The assumption is reasonable, but recombination rates vary from region to region, adding to complexity

• This simple fact is the basis of disease association mapping.

Page 34: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD and disease mapping

• Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover

which gene (locus) carries the mutation.• Consider every polymorphism, and check:

– There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to

the same disease

• Instead, consider a dense sample of polymorphisms that span the genome

Page 35: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD can be used to map disease genes

• LD decays with distance from the disease allele.

• By plotting LD, one can short list the region containing the disease gene.

011001

DNNDDN

LD

Page 36: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

LD and disease gene mapping problems

• Marker density?• Complex diseases• Population sub-structure

Page 37: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Population Genetics

• Often we look at these equilibria (Linkage/HW) and their deviations in specific populations

• These deviations offer insight into evolution.

• However, what is Normal?• A combination of empirical (simulation)

and theoretical insight helps distinguish between expected and unexpected.

Page 38: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Topic 2: Simulating population data

• We described various population genetic concepts (HW, LD), and their applicability

• The values of these parameters depend critically upon the population assumptions.– What if we do not have infinite populations– No random mating (Ex: geographic isolation)– Sudden growth– Bottlenecks– Ad-mixture

• It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?

Page 39: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Wright Fisher Model of Evolution

• Fixed population size from generation to generation

• Random mating

Page 40: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent model

• Insight 1: – Separate the genealogy from allelic states (mutations)– First generate the genealogy (who begat whom)– Assign an allelic state (0) to the ancestor. Drop mutations on the

branches.

Page 41: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent theory

• Insight 2: – Much of the genealogy is irrelevant, because it

disappears.– Better to go backwards

Page 42: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent theory (Kingman)

• Input – (Fixed population (N individuals), random

mating)• Consider 2 individuals.

– Probability that they coalesce in the previous generation (have the same parent)=

• Probability that they do not coalesce after t generations=

1

N

1− 1N( )

t

≅ e− t N

Page 43: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent theory

• Consider k individuals. – Probability that no pair coalesces after 1

generation

– Probability that no pair coalesces after t generations

1−

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

N

⎜ ⎜ ⎜

⎟ ⎟ ⎟

t

≅ e−

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟t

N

= e− k

2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟τ

is time in unitsof N generations

Page 44: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent approximation

• Insight 3:– Topology is independent of coalescent times– If you have n individuals, generate a

random binary topology• Iterate (until one individual)

– Pick a pair at random, and coalesce

• Insight 4:– To generate coalescent times, there is no

need to go back generation by generation

Page 45: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent approximation

• At any step, there are 1 <= k <= n individuals• To generate time to coalesce (k to k-1

individuals)– Pick a number from exponential distribution with rate

k(k-1)/2– Mean time to coalescence

= 2/(k(k-1))= 2/(k(k-1))

Page 46: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Typical coalescents

• 4 random examples with n=6 (Note that we do not need to specify N. Why?)

• Expected time to coalesce?

Page 47: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent properties

• Expected time for the last step

• The last step is half of the total time to coalesce• Studying larger number of individuals does not change

numbers tremendously• EX: Number of mutations in a population is proportional

to the total branch length of the tree– E(Ttot)

=1

Page 48: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Variants (exponentially growing populations)

• If the population is growing exponentially, the branch lengths become similar, or even star-like. Why?

• With appropriate scaling of time, the same process can be extended to various scenarios: male-female, hermaphrodite, segregation, migration, etc.

Page 49: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Simulating population data

• Generate a coalescent (Topology + Branch lengths)

• For each branch length, drop mutations with rate

• Generate sequence data• Note that the resulting sequence is a perfect phylogeny.• Given such sequence data, can you reconstruct the

coalescent tree? (Only the topology, not the branch lengths)

• Also, note that all pairs of positions are correlated (should have high LD).

Page 50: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Coalescent with Recombination

• An individual may have one parent, or 2 parents

Page 51: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

ARG: Coalescent with recombination

• Given: mutation rate , recombination rate , population size 2N (diploid), sample size n.

• How can you generate the ARG (topology+branch lengths) efficiently?

• How will you generate sequences for n individuals?

• Given sequence data, can you reconstruct the ARG (topology)

Page 52: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Recombination

• Define r as the probability of recombining. – Note that the parameter is a caled

value which will be defined later• Assume k individuals in a

generation. The following might happen:1. An individual arises because of a

recombination event between two individuals (It will have 2 parents).

2. Two individuals coalesce3. Neither (Each individual has a

distinct parent)4. Multiple events (low probability)

Page 53: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Recombination

• We ignore the case of multiple (> 1) events in one generation

• Pr (No recombination) = 1-kr• Pr (No coalescence)

• Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate

• The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA)

1−

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

2N

⎜ ⎜ ⎜

⎟ ⎟ ⎟

k2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

Page 54: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

• Let k = n,• Define • Iterate until k= 1

– Choose time from an exponential distribution with rate

– Pick event as recombination with probability

– If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce.

– Update k, and continue

ARG

=4rN

2+ k

2 ⎛ ⎝ ⎜ ⎞

⎠ ⎟

+ (k −1)

What is the flaw in this procedure?

Page 55: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Simulating sequences on the ARG

• Generate topology and branch lengths as before

• For each recombination, generate a position.

• Next generate mutations at random on branch lengths– For a mutation, select a position as well.

Page 56: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Recombination events and

• Given , n, can you compute the expected number of recombination events?

• It can be shown that E(n, ) = log (n)• The question that people are really interested

in• Given a set of sequences from a population, compute

the recombination rate • Given a population reconstruct the most likely

history (as an ancestral recombination graph)• We will address this question in subsequent lectures

Page 57: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

An algorithm for constructing a perfect phylogeny

• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.

• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every

node.• In each step, we add a column and refine

some of the nodes containing multiple children.

• Stop if all columns have been considered.

Page 58: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Inclusion Property

• For any pair of columns i,j– i < j if and only if i1

j1 • Note that if i<j then the

edge containing i is an ancestor of the edge containing i

i

j

Page 59: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Example

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

r

A B C D E

Initially, there is a single clade r, and each node has r as its parent

Page 60: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Sort columns

• Sort columns according to the inclusion property (note that the columns are already sorted here).

• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

Page 61: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Add first column

• In adding column i– Check each edge

and decide which side you belong.

– Finally add a node if you can resolve a clade

r

A BC DE

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

u

Page 62: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Adding other columns

• Add other columns on edges using the ordering property

r

E B

C

D

A

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0

1

2

4

3

5

Page 63: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Unrooted case

• Switch the values in each column, so that 0 is the majority element.

• Apply the algorithm for the rooted case

Page 64: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna

Page 65: March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner

March 2006 Vineet Bafna