Upload
erv
View
56
Download
2
Tags:
Embed Size (px)
DESCRIPTION
SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee. From the Tree of the Life Website, University of Arizona. How did life evolve on earth?. - PowerPoint PPT Presentation
Citation preview
SEPP and TIPP for metagenomic analysis
Tandy WarnowDepartment of Computer Science
University of Texas
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
Phylogeny (evolutionary tree)
How did life evolve on earth?
Courtesy of the Tree of Life project
Metagenomics:
Venter et al., Exploring the Sargasso Sea:
Scientists Discover One Million New Genes in Ocean Microbes
Major Challenges• Phylogenetic analyses: standard methods have poor
accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)
• Metagenomic analyses: methods for species classification of short reads have poor sensitivity. Efficient high throughput is necessary (millions of reads).
Computational Phylogenetics and Metagenomics
Courtesy of the Tree of Life project
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
…ACGGTGCAGTTACC-A……AC----CAGTCACCTA…
The true multiple alignment – Reflects historical substitution, insertion, and deletion
events– Defined using transitive closure of pairwise alignments
computed on edges of the true tree
…ACGGTGCAGTTACCA…
SubstitutionDeletion
…ACCAGTCACCTA…
Insertion
AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA
U V W X Y
U
V W
X
Y
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
Phase 1: Multiple Sequence Alignment
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3
Simulation Studies
S1 S2
S3S4
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1 = -AGGCTATCACCTGACCTCCA
S2 = TAG-CTATCAC--GACCGC--S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA
Compare
True tree and alignment
S1 S4
S3S2
Estimated tree and alignment
Unaligned Sequences
1000 taxon models, ordered by difficulty (Liu et al., 2009)
Problems• Large datasets with high rates of evolution are hard to
align accurately, and phylogeny estimation methods produce poor trees when alignments are poor.
• Many phylogeny estimation methods have poor accuracy on large datasets (even if given correct alignments)
• Potentially useful genes are often discarded if they are difficult to align.
These issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)
Major Challenges
• Current phylogenetic datasets contain hundreds to thousands of taxa, with multiple genes.
• Future datasets will be substantially larger (e.g., iPlant plans to construct a tree on 500,000 plant species)
• Current methods have poor accuracy or cannot run on large datasets.
Phylogenetic “boosters” (meta-methods)
Goal: improve accuracy, speed, robustness, or theoretical guarantees of base methods
Examples:• DCM-boosting for distance-based methods (1999)• DCM-boosting for heuristics for NP-hard problems (1999)• SATé-boosting for alignment methods (2009)• SuperFine-boosting for supertree methods (2011) • DACTAL-boosting for all phylogeny estimation methods (2011)• SEPP-boosting for metagenomic analyses (2012)
• SATé: Simultaneous Alignment and Tree Estimation (Liu et al., Science 2009, and Liu et al. Systematic Biology, 2011)
• SEPP: SATé-enabled Phylogenetic Placement (Mirarab, Nguyen and Warnow, Pacific Symposium on Biocomputing 2012)
• TIPP: Taxon Identification using Phylogenetic Placement (Nguyen, Mirarab, and Warnow, in preparation - TIPP+Metaphyler collaboration with Mihai Pop and Bo Liu)
Today’s Talk
Part 1: SATé
Liu, Nelesen, Raghavan, Linder, and Warnow, Science, 19 June 2009, pp. 1561-1564.
Liu et al., Systematic Biology, 2011, 61(1):90-106
Public software distribution (open source) through the University of Kansas, in use, world-wide
Two-phase estimationAlignment methods• Clustal• POY (and POY*)• Probcons (and Probtree)• Probalign• MAFFT• Muscle• Di-align• T-Coffee • Prank (PNAS 2005, Science 2008)• Opal (ISMB and Bioinf. 2007)• FSA (PLoS Comp. Bio. 2009)• Infernal (Bioinf. 2009)• Etc.
Phylogeny methods• Bayesian MCMC • Maximum parsimony • Maximum likelihood • Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.
RAxML: heuristic for large-scale ML optimization
1000 taxon models, ordered by difficulty (Liu et al., 2009)
SATé Algorithm
Tree
Obtain initial alignment and estimated ML tree
SATé Algorithm
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
SATé Algorithm
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
SATé Algorithm
Estimate ML tree on new alignment
Tree
Obtain initial alignment and estimated ML tree
Use tree to compute new alignment
Alignment
If new alignment/tree pair has worse ML score, realign using a different decomposition
Repeat until termination condition (typically, 24 hours)
A
B D
C
Merge subproblems
Estimate ML tree on merged
alignment
Decompose based on input
treeA B
C DAlign
subproblems
A B
C D
ABCD
One SATé iteration (really 32 subsets)
e
1000 taxon models, ordered by difficulty
1000 taxon models, ordered by difficulty
24 hour SATé analysis, on desktop machines
(Similar improvements for biological datasets)
1000 taxon models ranked by difficulty
Part II: SEPP
• SEPP: SATé-enabled Phylogenetic Placement, by Mirarab, Nguyen, and Warnow
• Pacific Symposium on Biocomputing, 2012 (special session on the Human Microbiome)
Metagenomic data analysisNGS data produce fragmentary sequence dataMetagenomic analyses include unknown
species
Taxon identification: given short sequences, identify the species for each fragment
Applications: Human MicrobiomeIssues: accuracy and speed
Phylogenetic PlacementInput: Backbone alignment and tree on full-
length sequences, and a set of query sequences (short fragments)
Output: Placement of query sequences on backbone tree
Phylogenetic placement can be used for taxon identification, but it has general applications for phylogenetic analyses of NGS data.
Phylogenetic Placement● Align each query sequence to
backbone alignment
● Place each query sequence into backbone tree, using extended alignment
Align Sequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = TAAAAC
Align Sequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = -------T-A--AAAC--------
Place Sequence
S1
S4
S2
S3Q1
S1 = -AGGCTATCACCTGACCTCCA-AAS2 = TAG-CTATCAC--GACCGC--GCAS3 = TAG-CT-------GACCGC--GCTS4 = TAC----TCAC--GACCGACAGCTQ1 = -------T-A--AAAC--------
Phylogenetic Placement• Align each query sequence to backbone alignment
– HMMALIGN (Eddy, Bioinformatics 1998)– PaPaRa (Berger and Stamatakis, Bioinformatics 2011)
• Place each query sequence into backbone tree– Pplacer (Matsen et al., BMC Bioinformatics, 2011)– EPA (Berger and Stamatakis, Systematic Biology 2011)
Note: pplacer and EPA use maximum likelihood
HMMER vs. PaPaRa Alignments
Increasing rate of evolution
0.0
Insights from SATé
Insights from SATé
Insights from SATé
Insights from SATé
Insights from SATé
SEPP Parameter Exploration Alignment subset size and placement
subset size impact the accuracy, running time, and memory of SEPP
10% rule (subset sizes 10% of backbone) had best overall performance
SEPP (10%-rule) on simulated data
0.0
0.0
Increasing rate of evolution
SEPP (10%) on Biological Data
For 1 million fragments:
PaPaRa+pplacer: ~133 days
HMMALIGN+pplacer: ~30 days
SEPP 1000/1000: ~6 days
16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments
SEPP (10%) on Biological Data
For 1 million fragments:
PaPaRa+pplacer: ~133 days
HMMALIGN+pplacer: ~30 days
SEPP 1000/1000: ~6 days
16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments
Part III: Taxon Identification•Objective: identify the taxonomy (species, genus, etc.) for each short read (a classification problem)
Taxon Identification● Objective: identify species, genus, etc., for each
short read
● Leading methods: Metaphyler (Univ Maryland), Phylopythia, PhymmBL, Megan
Megan vs MetaPhyler on 60bp rpsB gene
OBSERVATIONS
• MEGAN is very conservative • MetaPhyler makes more correct predictions than MEGAN• Other methods (Liu et al, BMC Bioinformatics 2011) not as sensitive (on these 31 marker genes) as MetaPhyler
Thus, the best taxon identification methods have high precision (make accurate predictions), but low sensitivity (i.e., they fail to classify a large portion of reads) even at higher taxonomy levels.
TIPP: Taxon Identification by Phylogenetic Placement
•ACT..TAGA..A• (species5)
•AGC...ACA• (species4)
•TAGA...CTT• (species3)
•TAGC...CCA•(species2)
•AGG...GCAT• (species1)
•ACCG•CGAG•CGG•GGCT•TAGA•GGGGG•TCGAG•GGCG•GGG•.•.•.•ACCT
•(60-200 bp long)
•Fragmentary Unknown Reads: •Known Full length Sequences,•and an alignment and a tree
•(500-10,000 bp long)
TIPP: Taxon Identification using Phylogenetic Placement - Version 1
Given a set Q of query sequences for some gene, a taxonomy T, and a set of full-length sequences for the gene,
• Compute reference alignment and tree on the full-length sequences, using SATé
• Use SEPP to place each query sequence into the taxonomy (alignment subsets computed on the reference alignment/tree, then inserted into taxonomy T)
TIPP version 2- considering uncertainty
TIPP version 1 too aggressive (over-classification)TIPP version 2 dramatically reduces false positive rate with small reduction in true
positive rate, by considering uncertainty, using statistical techniques:
• Find 2-4 alignments of reference sequences and construct tree (SATé, ClustalW, MAFFT, others)
• For each reference alignment/tree pair, compute many extended alignments (using statistical support computed using HMMER to cover 95% of the probability).
• For each extended alignment, use pplacer statistical support values to place fragment into taxonomy, so that the clade below the placement contains 99% of the probability.
Classify each fragment at the LCA of all placements obtained for the fragment
TIPP+Metaphyler
• Use Metaphyler to perform initial placement of read into taxonomy
• Use TIPP to modify the placement, moving the read further into the clade identified by Metaphyler
Results on rpsB gene (60 bp)
• SATé: co-estimation of alignments and trees
• SEPP/TIPP: phylogenetic analysis of fragmentary data
Algorithmic strategies: divide-and-conquer and iteration to improve the accuracy and scalability of a base method
Phylogenetic “Boosters”
Summary
• SATé gives better alignments and trees than standard alignment estimation methods
• SEPP can enable alignment of short (fragmentary) sequences into alignments of full-length sequences, and phylogenetic placement into gene trees or taxonomies
• TIPP enables taxon identification of short reads -- not limited to 31 marker genes, and no training is needed.
Overall message
• When data are difficult to analyze, develop better methods - don’t throw out the data.
Acknowledgments• Guggenheim Foundation Fellowship, Microsoft Research
New England, National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants, and David Bruton Jr. Professorship
• NSERC support to Siavash Mirarab
• Collaborators: – SATé: Kevin Liu, Serita Nelesen, Sindhu Raghavan, and
Randy Linder– SEPP: Siavash Mirarab and Nam Nguyen– TIPP: Siavash Mirarab, Nam Nguyen, Bo Liu, and Mihai
Pop