ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA

Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Traditional methods for building phylogeny

Requirements:• High coverage

• Assembly• Detection of putative orthologous genes

• Alignment

• Phylogeny from tiny portion of the whole genome

• Genome scale multi-sequence alignment is difficult

Alignment-free methods for building phylogeny

• Typically from assembled genomes• De novo assembly with short reads?

• Mainly on closely related prokaryotic genomes

• No confidence assessment (e.g. bootstrapping)

Overview• Assembly and Alignment-Free method (AAF)

• Calculate phylogenetic distances using whole genome short read sequencing data

• Method validation• Genome complexity• Different genome sizes• Sequencing errors• Range of sequencing coverage

• 12 mammal species• 21 tropical tree species

• Comparision with andi

AAF method• Calculate pairwise genetic distances between each

sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes.

• Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix

AAF method - Evolutionary model • The probability that no mutation will occur within a given k-

mer between species A and B is exp(−kd).• If only substitutions occurred, all k-mers are unique, then all

the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt.

• Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt

• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers

• Greater effect

K-mer sensitivity and homoplasy • No assembly -> not all indels identified

• If k-mer covers multiple substitutions

• Shorter k-mers -> better sensitivity

• Shorter k-mers -> same k-mers from evolutionary different regions• Homoplasy

K-mer homoplasy• k=15• Genome size > 5x108 => same k-mers randomly in other

species

• May incorrectly inflate the proportion of shared k-mers

• The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size

ph

• Prediction of the ratio ns/nt

• Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exac-

erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition.

• GC content

• Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

Mathematical prediction

Random ancestral sequence

Real (non-random) sequence

Assembly-free• Sampling error caused by low genome coverage

• The actual number of k-mers will be under-represented given low sequencing coverage

• Sequencing errors • Loss of true k-mers and the gain of false k-mers• Filtering = remove singletons

Seq errors p=observed/true

Coverage 5-8 sufficient to observe all true k-mers when filtering

=> Tip corrections

Filter only singletons?

Filter only singletons?

BootstrappingNonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k

OR

Two-stage parametric bootstrap• Estimate the variances in distances between species

caused by sampling and evolutionary variation• Independent of genome size

Bushbaby (galago)

Tarsier

Recently published phylogeny of primates

Assembled genomes, k=19

Assembled genomes, k=21

Simulated reads

Simulated reads

Real data – tropical trees

Intsia palembanica

Advantages• Low coverage

requirements

• Low computational demands • 12 primates 25GB RAM, 12

threads

Limitations• Loss of k-mer sensitivity

• Deep nodes

• Location of mutations

Distance computing for 73 Escherichia strains

• AAF• 32+76 = 1h 48min

• andi• 21 min

AAF andi

Documents

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon