14
Building a Foundation to Enable Semantic Technologies for Phylogenetically-Based Comparative Analyses Maryam Panahiazar 1 , Arlin Stoltzfus 2 , Rutger Vos 3 , Enrico Pontelli 4 and Jim Leebens-Mack 1 1 University of Georgia, USA 2 NIST & University of Maryland, USA 3 University of Reading, UK 4 New Mexico State University Phyloinformatics 07/03/22 1

iEvobIO

Embed Size (px)

DESCRIPTION

Building a Foundation to Enable Semantic Technologies for Phylogenetically-Based Comparative Analyses

Citation preview

Page 1: iEvobIO

Building a Foundation to Enable Semantic Technologies for Phylogenetically-Based

Comparative Analyses

Maryam Panahiazar1, Arlin Stoltzfus2, Rutger Vos3, Enrico Pontelli4 and Jim Leebens-Mack1

1University of Georgia, USA2NIST & University of Maryland, USA 3University of Reading, UK 4New Mexico State University

Phyloinformatics

04/14/23 1

Page 2: iEvobIO

Phyloinformatics

04/14/23 2

Motivation

“Nothing in biology makes sense except in the light of evolution” (Theodosius Dobzhansky, 1973)…. and

Nothing in evolution makes sense except in the light of phylogeny

Page 3: iEvobIO

04/14/23 3

For example - Prediction of gene and protein function

Jonathan A. Eisen, 1998,Genome Research, 8:163-167

Phyloinformatics

1. Choose gene of interest1. Choose gene of interest 2. Identify homolog2. Identify homolog

3.Align sequences3.Align sequences

4.Calculate gene tree4.Calculate gene tree

5.Overaly known functions onto tree5.Overaly known functions onto tree

6. Hypothesize function for all genes6. Hypothesize function for all genes

7. Reconcile gene and species trees 7. Reconcile gene and species trees

After Eisen 1998,Genome Research

Page 4: iEvobIO

04/14/23 4

Example 2 – Testing congruence among phylogeographic analyses

Knowles 2009 after Avis 1992

Phyloinformatics

1. Compile results of phylogeographic analyses for multiple species from the same geographic region

1. Compile results of phylogeographic analyses for multiple species from the same geographic region

2. Apply demographic models to account for variation in generation times and substitution rates

2. Apply demographic models to account for variation in generation times and substitution rates

After Knowles 2009, Annu. Rev. Ecol. Evol. Syst.

Page 5: iEvobIO

Applying Semantics to BioinformaticsIntegrative bioinformatics experimentation cycle

04/14/23 5

1.Problem Definition1.Problem Definition

2. Experimental Design2. Experimental Design

3. Data Integration3. Data Integration

4. Data Analysis4. Data Analysis

15. Interpretation15. Interpretation

Biological hypothesisBiological hypothesis

ProtocolProtocol

Raw integration resultRaw integration result

Analysis resultAnalysis result

knowledgeknowledge

1. Imported or create data and knowledge models

1. Imported or create data and knowledge models

2.Use data models to transform raw data to RDF data

2.Use data models to transform raw data to RDF data

3. Link data models to knowledge models

3. Link data models to knowledge models

4. Select common domain4. Select common domain

5. Construct and run semantic query5. Construct and run semantic query

Raw integration resultRaw integration result

Lennart J.G. Post, Marco Roos, M. Scott Marshall, Roel van Driel and Timo M. Breit. A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data, Vol. 23 no. 22 2007, pages 3080–3087 doi:10.1093/bioinformatics/btm461

Page 6: iEvobIO

Phyloinformatics

04/14/23 6

Requirements for data reuse in comparative analyses:

• Easy access to machine-readable trees, data matrices and meta-data (e.g. sample characteristics including sample locality)

• A minimum reporting standard for phylogenetic analyses (MIAPA).

• A controlled vocabulary for describing components of phylogenetic workflows

Page 7: iEvobIO

Bioinformatics and phylogeny

04/14/23 7

Proposed components of a minimum reporting standard for phylogenetic analyses:

Leebens-Mack et al. 2006 OMICS

Page 8: iEvobIO

Bioinformatics and phylogeny

04/14/23 8

Developing an ontology for describing phylogentic workflows:

1. Catalogue published methods of phylogentic analysis (https://www.nescent.org/sites/evoio/MIAPA/PhyloWays),

2. Develop ontology that would accommodate published phylogenetic workflows,

3. Evaluate utility of ontology for describing published phylogenetic workflows.

4. Use ontology to construct NeXML files with annotated trees and data matrices

5. Elicit feedback from the Systematics community

Page 9: iEvobIO

Phyloinformatics

04/14/23 9

PhyloWays entry:

Publication:Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, Refulio-Rodriguez NF, Walker JB, Moore MJ, Carlsward BS, et al. 2011. Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot 2011:ajb.1000404. - http://www.amjbot.org/cgi/reprint/ajb.1000404v1

Data: concatenated alignments for a superset of 14loci/17 genes (nucleotide sequences) sampled from 640 species. Genes included 18S rDNA (nuc), 26S rDNA (nuc), atpB (cp), atp1 (mito), matK (cp), matR (mito), nad5 (mito), ndhF (cp), psbBTNH (cp 4 gene region), rbcL (cp), rpoC2 (cp), rps16 (cp), rps3 (mito), and rps4 (cp).

Alignment method: MAFFT used to align each of 14 loci; "adjustments were made by eye when there were obvious alignment errors due to particularly divergent or “ gappy ” sequences"; Sites (columns) with > 50% missing data (including gaps due to indels) were removed using Phyutility (Smith and Dunn, 2008). All or subsets of gene alignments concatenated for phylogenetic analysis.

Tree estimation: ML analyses performed the following data matrices; nuclear rDNA genes; cp genes; mito genes; nuclear+cp genes; all 17 genes; 10 independent runs for each data matrix. Program - RAxML (vers. 7.1; Stamatakis, 2006 ). - Model of sequence evolution - GTRGAMMA with parameters estimated separately (unlinked) for each gene partition. - Method for evaluating support - 100-300 bootstrap replicates

Page 10: iEvobIO

BPhyloinformatics

04/14/23 10

Current components of PhylOnt, an ontology for describing phylogenetics workflows:

• Tree estimation program

• Method of analysis

• Construction of data matrix

• Alignment….

• Tree estimation

• optimality criterion….

• branch swamping…

• support assessment…

Page 11: iEvobIO

Phyloinformatics

04/14/23 11

Tree estimation program ontology

Page 12: iEvobIO

Phyloinformatics

04/14/23 12

Data analysis ontology diagram

Page 13: iEvobIO

Phyloinformatics

04/14/23 13

Models for character state transitions (e.g. nucleotide substitution model)

Page 14: iEvobIO

Phyloinformatics

04/14/23 14

1. Complete PhylOnt

2. Develop NeXML file builder that uses PhylOnt concepts

3. Formalize Minimum Information about Phylogenetic Analyses (MIAPA) reporting standard

4. Evaluate and refine PhylOnt for construction of MIAPA – compliant NeXML files

Next steps: