Upload
vivian-bradford
View
217
Download
0
Embed Size (px)
DESCRIPTION
Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)
Citation preview
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA
Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon
Traditional methods for building phylogeny
Requirements:• High coverage
• Assembly• Detection of putative orthologous genes
• Alignment
• Phylogeny from tiny portion of the whole genome
• Genome scale multi-sequence alignment is difficult
Alignment-free methods for building phylogeny
• Typically from assembled genomes• De novo assembly with short reads?
• Mainly on closely related prokaryotic genomes
• No confidence assessment (e.g. bootstrapping)
Overview• Assembly and Alignment-Free method (AAF)
• Calculate phylogenetic distances using whole genome short read sequencing data
• Method validation• Genome complexity• Different genome sizes• Sequencing errors• Range of sequencing coverage
• 12 mammal species• 21 tropical tree species
• Comparision with andi
AAF method• Calculate pairwise genetic distances between each
sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes.
• Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix
AAF method - Evolutionary model • The probability that no mutation will occur within a given k-
mer between species A and B is exp(−kd).• If only substitutions occurred, all k-mers are unique, then all
the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt.
• Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt
• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers
• Greater effect
K-mer sensitivity and homoplasy • No assembly -> not all indels identified
• If k-mer covers multiple substitutions
• Shorter k-mers -> better sensitivity
• Shorter k-mers -> same k-mers from evolutionary different regions• Homoplasy
K-mer homoplasy• k=15• Genome size > 5x108 => same k-mers randomly in other
species
• May incorrectly inflate the proportion of shared k-mers
• The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size
ph
• Prediction of the ratio ns/nt
• Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exac-
erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition.
• GC content
• Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.
Mathematical prediction
Random ancestral sequence
Real (non-random) sequence
Assembly-free• Sampling error caused by low genome coverage
• The actual number of k-mers will be under-represented given low sequencing coverage
• Sequencing errors • Loss of true k-mers and the gain of false k-mers• Filtering = remove singletons
Seq errors p=observed/true
Coverage 5-8 sufficient to observe all true k-mers when filtering
=> Tip corrections
Filter only singletons?
Filter only singletons?
BootstrappingNonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k
OR
Two-stage parametric bootstrap• Estimate the variances in distances between species
caused by sampling and evolutionary variation• Independent of genome size
Bushbaby (galago)
Tarsier
Recently published phylogeny of primates
Assembled genomes, k=19
Assembled genomes, k=21
Simulated reads
Simulated reads
Real data – tropical trees
Intsia palembanica
Advantages• Low coverage
requirements
• Low computational demands • 12 primates 25GB RAM, 12
threads
Limitations• Loss of k-mer sensitivity
• Deep nodes
• Location of mutations
Distance computing for 73 Escherichia strains
• AAF• 32+76 = 1h 48min
• andi• 21 min
AAF andi