42
Phylogenetic analysis for molecular sequence data João C. Setubal University of São Paulo Agosto 2012 1 8/23/2012 J. C. Setubal

Phylogenetic analysis for molecular sequence data · 2013. 7. 31. · Phylogenetic analysis for molecular sequence data João C. Setubal University of São Paulo . Agosto 2012 . 8/23/2012

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Phylogenetic analysis for molecular sequence data

    João C. Setubal University of São Paulo

    Agosto 2012

    1 8/23/2012 J. C. Setubal

  • Outline

    1. What is the biological question? 2. What input sequences should be used? 3. Analysis pipeline: steps and components 4. Output visualization 5. Output interpretation

    2 8/23/2012 J. C. Setubal

  • Biological questions

    • How do oomycete species relate to one another and to other species?

    • What is the history of a particular gene? – Gene trees vs. species trees – Lateral Gene Transfer

    • Other questions

    3 8/23/2012 J. C. Setubal

  • 4 Credit: www.apsnet.org 8/23/2012 J. C. Setubal

  • Taxonomy is not phylogeny class Oomycota

    • Kingdom: Chromalveolata • Phylum: Heterokontophyta • Class: Oomycota • Orders (& families) • Lagenidiales

    – Lagenidiaceae – Olpidiosidaceae – Sirolpidiaceae

    • Leptomitales – Leptomitaceae

    • Peronosporales – Albuginaceae – Peronosporaceae – Pythiaceae

    • Rhipidiales – Rhipidaceae

    • Saprolegniales – Ectrogellaceae – Haliphthoraceae – Leptolegniellaceae – Saprolegniaceae

    • Thraustochytriales

    Phytophthora

    http://en.wikipedia.org/wiki/Chromalveolate�http://en.wikipedia.org/wiki/Heterokont�http://en.wikipedia.org/wiki/Lagenidiales�http://en.wikipedia.org/w/index.php?title=Lagenidiaceae&action=edit&redlink=1�http://en.wikipedia.org/w/index.php?title=Olpidiosidaceae&action=edit&redlink=1�http://en.wikipedia.org/w/index.php?title=Sirolpidiaceae&action=edit&redlink=1�http://en.wikipedia.org/wiki/Leptomitales�http://en.wikipedia.org/wiki/Leptomitales�http://en.wikipedia.org/wiki/Peronosporales�http://en.wikipedia.org/wiki/Albuginaceae�http://en.wikipedia.org/wiki/Peronosporaceae�http://en.wikipedia.org/wiki/Pythiaceae�http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1�http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1�http://en.wikipedia.org/wiki/Saprolegniales�http://en.wikipedia.org/w/index.php?title=Ectrogellaceae&action=edit&redlink=1�http://en.wikipedia.org/w/index.php?title=Haliphthoraceae&action=edit&redlink=1�http://en.wikipedia.org/w/index.php?title=Leptolegniellaceae&action=edit&redlink=1�http://en.wikipedia.org/wiki/Saprolegniaceae�http://en.wikipedia.org/w/index.php?title=Thraustochytriales&action=edit&redlink=1�

  • Input sequences

    • They should belong to the same homologous family (Cf. Friday lecture)

    6 8/23/2012 J. C. Setubal

  • Pipeline

    1. Multiple sequence alignment (MSA) 2. Alignment editing 3. Phylogeny reconstruction 4. Visualization

    7 8/23/2012 J. C. Setubal

  • Multiple Sequence Alignment

    8 8/23/2012 J. C. Setubal

  • Multiple Sequence Alignment

    • Generalization of pairwise alignment – Optimum vs. approximation – All practical programs for MSA produce approximations

    • DNA or amino acids – DNA is more sensitive; but 3rd codon position is less

    informative – Amino acids allow more distant proteins to be included

    • Scoring matrices: BLOSUM, PAM

    • Aligned sites (a column) should be homologous • Output formats: clustal, FASTA, MSF, NEXUS, PHYLIP

    – http://molecularevolution.org/resources/fileformats/converting 9 8/23/2012 J. C. Setubal

    http://molecularevolution.org/resources/fileformats/converting�

  • Programs for MSA

    • Muscle – Edgar, R.C. (2004) Nucleic Acids Res. 32(5):1792-1797

    – www.drive5.com/muscle • MAFFT

    – Katoh, Misawa, Kuma, Miyata 2002 (Nucleic Acids Res. 30:3059-3066)

    – mafft.cbrc.jp/alignment/software/ • ClustalW/X • Cobalt (NCBI) • T-coffee

    8/23/2012 J. C. Setubal 10

  • Input sequences

    • Should be related to each other • Cannot be too long (less than ~10kb) • Not too many (less than ~100) • (numbers vary depending on program and on

    computer) • FASTA format is best

    11 8/23/2012 J. C. Setubal

  • Alignment editing

    12 8/23/2012 J. C. Setubal

    Credit: R. Dixon

  • Alignment editing

    • Certain columns may be uninformative • Sometimes humans can see better alignments • Manual editing

    – Jalview: www.jalview.org – Waterhouse et al. Bioinformatics 2009 25 (9) 1189-1191

    – Seaview: http://pbil.univ-lyon1.fr/software/seaview.html • Gouy M., Guindon S. & Gascuel O. (2010) Molecular Biology and Evolution

    27(2):221-224

    • Automatic editing: Gblocks – http://molevol.cmima.csic.es/castresana/Gblocks_server.html – Castresana, J. (2000) Molecular Biology and Evolution 17, 540-552

    13 8/23/2012 J. C. Setubal

    http://www.jalview.org/�http://pbil.univ-lyon1.fr/software/seaview.html�

  • JALVIEW http://www.jalview.org/

    14 8/23/2012 J. C. Setubal

  • Phylogeny reconstruction

    15 8/23/2012 J. C. Setubal

    Credit: R. Dixon

  • A

    B

    Cladogram version

    Topology and branch lengths: A tree and a cladogram

    8/23/2012 J. C. Setubal 16

    Credit: Wattam et al. 2011

    Branch lengths: # of substitutions per site

  • Unrooted tree (no outgroup)

    17 8/23/2012 J. C. Setubal

    http://itol.embl.de

    http://itol.embl.de/�

  • Rooted tree: needs outgroup

    8/23/2012 J. C. Setubal 18

  • Phylogeny reconstruction methods

    • Distance – Distance matrix

    • Parsimony – Minimize mutations along branches

    • Maximum likelihood – Searches for the most likely tree under a

    probabilistic model

    • Bayesian inference – Also probabilistic, but using bayesian approach

    19 8/23/2012 J. C. Setubal

  • Running time considerations

    • In the last century, distance and parsimony methods were dominant – the others were too slow

    • Now Maximum Likelihood has become a “standard”

    8/23/2012 J. C. Setubal 20

  • Models of evolution

    • Except for distance methods, all other methods must rely on models for the evolution of sequences

    8/23/2012 J. C. Setubal 21

  • Evolution of models for DNA evolution

    8/23/2012 J. C. Setubal 22

    http://authors.library.caltech.edu/5456/1/hrst.mit.edu/hrs/evolution/public/models/sequence.html

  • Protein evolution

    • Amino acid substitution matrices – PAM – BLOSUM – WAG

    • Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699

    8/23/2012 J. C. Setubal 23

  • Models in PhyML

    • DNA – JC69, K80, F81, F84, HKY85, TN93, GTR, custom

    • Amino acids – LG, WAG, Dayhoff, JTT, Blosum62, mtREV, rtREV,

    cpREV,DCMut, VT, mtMAM, custom

    8/23/2012 J. C. Setubal 24

  • Phylogeny reconstruction programs

    • PHYLIP – Joe Felsenstein – http://evolution.genetics.washington.edu/phylip.html

    • PAUP – David Swofford – http://paup.csit.fsu.edu/

    • Distance – Neighbor-joining, UPGMA

    • Parsimony 25 8/23/2012 J. C. Setubal

  • Maximum likelihood

    • RaXML – A. Stamatakis – http://www.exelixis-lab.org/

    • phyML – O. Gascuel et al. Systematic Biology, 59(3):307-21, 2010

    – http://www.atgc-montpellier.fr/phyml/ • fastTree

    – Morgan N. Price in Adam Arkin’s group – http://www.microbesonline.org/fasttree/ – “FastTree can handle alignments with up to a million of

    sequences in a reasonable amount of time and memory”

    8/23/2012 J. C. Setubal 26

  • A performance data point

    • An ML tree for about 500 protein sequences about 300 aa in length each

    • RAxML or PHYml took about 10 hours • Fasttree took less than 1 hour

    8/23/2012 J. C. Setubal 27

  • Bayesian inference

    • MrBayes • Ronquist and Huelsenbeck. Bioinformatics.

    2003 19(12):1572-4. • http://mrbayes.sourceforge.net/ • Slower compared to RAxML and phyML

    8/23/2012 J. C. Setubal 28

    http://mrbayes.sourceforge.net/�

  • Tree visualization: formats

    • Newick, NEXUS • (((erHomoC:0.28006,erCaelC:0.22089):0.40998,(erH

    omoA:0.32304, (erpCaelC:0.58815,((erHomoB:0.5807,erCaelB:0.23569):0.03586, erCaelA:0.38272):0.06516):0.03492):0.14265):0.63594,(TRXHomo:0.65866, TRXSacch:0.38791):0.32147,TRXEcoli:0.57336);

    • http://molecularevolution.org/resources/treeformats

    29 8/23/2012 J. C. Setubal

  • Tree visualization

    • Interactive Tree of Life http://itol.embl.de • http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software

    8/23/2012 J. C. Setubal 30

    http://itol.embl.de/�

  • 8/23/2012 J. C. Setubal 31

  • All-in-one: phylogeny.fr

    32 8/23/2012 J. C. Setubal

  • Phylogeny.fr (2)

    33 8/23/2012 J. C. Setubal

  • Building your tree locally: SeaView

    8/23/2012 J. C. Setubal 34

  • Interpretation

    • Trees are just hypotheses • They can suffer from GIGO • Most likely tree may not be the true tree • Confidence in the topology

    – Bootstrap values • Should be above 0.7 (70%)

    – Costly to compute – PhyML provides approximate bootstrap values that are

    much faster to compute • It’s always a good idea to try more than one reconstruction

    method

    35 8/23/2012 J. C. Setubal

  • Supermatrix approach

    • Good for obtaining robust species tree when complete or nearly complete genomes are available (phylogenomics)

    • Find all families that have exactly one representative from each genome

    • MSA for each family • Concatenate all MSAs • Build tree based on concatenated alignment

    8/23/2012 J. C. Setubal 36

  • Ciccarelli et al, Science, 2006

  • Eisen & Wu, Genome Biology, 2008

  • The bane of species trees: Horizonta Gene Transfer

    • Likely when gene tree differs from species tree • Can be detected by other methods

    – Sequence composition deviation – Genomic islands

    8/23/2012 J. C. Setubal 39

  • Network models for gene sharing

    • Current research topic

    8/23/2012 J. C. Setubal 40

    Kloesges et al, Molecular Biology and Evolution, 2011

    Review: Tal Dagan. Phylogenomic networks. Trends in Microbiology, 19(10), 483-491, 2011

  • Additional Resource

    • http://www.megasoftware.net/

    41 8/23/2012 J. C. Setubal

    http://www.megasoftware.net/�

  • Books

    • Bioinformatics. Baxevanis and Ouellette (Eds.) Wiley-Interscience, 2005 (3rd edition), ch. 14

    • D. Mount. Bioinformatics. CSHL Press, 2004 (2nd edition), ch. 7

    • The phylogenetic handbook. Lemey, Salemi and Vandamme (Eds.) Cambridge University Press, 2009 (2nd edition)

    8/23/2012 J. C. Setubal 42

    Phylogenetic analysis for molecular sequence dataOutlineBiological questionsSlide Number 4Taxonomy is not phylogeny�class OomycotaInput sequencesPipelineMultiple Sequence AlignmentMultiple Sequence AlignmentPrograms for MSAInput sequencesAlignment editingAlignment editingSlide Number 14Phylogeny reconstructionTopology and branch lengths:�A tree and a cladogramUnrooted tree (no outgroup)�Rooted tree: needs outgroupPhylogeny reconstruction methodsRunning time considerationsModels of evolutionEvolution of models for DNA evolutionProtein evolutionModels in PhyMLPhylogeny reconstruction programsMaximum likelihoodA performance data pointBayesian inferenceTree visualization: formatsTree visualizationSlide Number 31All-in-one: phylogeny.frPhylogeny.fr (2)Building your tree locally: SeaViewInterpretationSupermatrix approach�Slide Number 37Slide Number 38The bane of species trees: Horizonta Gene TransferNetwork models for gene sharingAdditional ResourceBooks