28
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget- Groba and Charles H. Cannon

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Embed Size (px)

DESCRIPTION

Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)

Citation preview

Page 1: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA

Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Page 2: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Traditional methods for building phylogeny

Requirements:• High coverage

• Assembly• Detection of putative orthologous genes

• Alignment

• Phylogeny from tiny portion of the whole genome

• Genome scale multi-sequence alignment is difficult

Page 3: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Alignment-free methods for building phylogeny

• Typically from assembled genomes• De novo assembly with short reads?

• Mainly on closely related prokaryotic genomes

• No confidence assessment (e.g. bootstrapping)

Page 4: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Overview• Assembly and Alignment-Free method (AAF)

• Calculate phylogenetic distances using whole genome short read sequencing data

• Method validation• Genome complexity• Different genome sizes• Sequencing errors• Range of sequencing coverage

• 12 mammal species• 21 tropical tree species

• Comparision with andi

Page 5: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

AAF method• Calculate pairwise genetic distances between each

sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes.

• Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix

Page 6: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

AAF method - Evolutionary model • The probability that no mutation will occur within a given k-

mer between species A and B is exp(−kd).• If only substitutions occurred, all k-mers are unique, then all

the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt.

• Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt

• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers

• Greater effect

Page 7: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

K-mer sensitivity and homoplasy • No assembly -> not all indels identified

• If k-mer covers multiple substitutions

• Shorter k-mers -> better sensitivity

• Shorter k-mers -> same k-mers from evolutionary different regions• Homoplasy

Page 8: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

K-mer homoplasy• k=15• Genome size > 5x108 => same k-mers randomly in other

species

• May incorrectly inflate the proportion of shared k-mers

• The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size

Page 9: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

ph

• Prediction of the ratio ns/nt

• Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exac-

erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition.

• GC content

• Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

Page 10: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Mathematical prediction

Page 11: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Random ancestral sequence

Page 12: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Real (non-random) sequence

Page 13: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Assembly-free• Sampling error caused by low genome coverage

• The actual number of k-mers will be under-represented given low sequencing coverage

• Sequencing errors • Loss of true k-mers and the gain of false k-mers• Filtering = remove singletons

Page 14: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Seq errors p=observed/true

Coverage 5-8 sufficient to observe all true k-mers when filtering

=> Tip corrections

Page 15: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Filter only singletons?

Page 16: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Filter only singletons?

Page 17: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

BootstrappingNonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k

OR

Two-stage parametric bootstrap• Estimate the variances in distances between species

caused by sampling and evolutionary variation• Independent of genome size

Page 18: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Bushbaby (galago)

Page 19: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Tarsier

Page 20: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Recently published phylogeny of primates

Page 21: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Assembled genomes, k=19

Page 22: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Assembled genomes, k=21

Page 23: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Simulated reads

Page 24: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Simulated reads

Page 25: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Real data – tropical trees

Intsia palembanica

Page 26: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Advantages• Low coverage

requirements

• Low computational demands • 12 primates 25GB RAM, 12

threads

Limitations• Loss of k-mer sensitivity

• Deep nodes

• Location of mutations

Page 27: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Distance computing for 73 Escherichia strains

• AAF• 32+76 = 1h 48min

• andi• 21 min

Page 28: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

AAF andi