11
[AromaDeg extended manual] Background Understanding prokaryotic transformation of recalcitrant pollutants and the in-situ metabolic nets requires the integration of a massive amount of biological data. Decades of biochemical studies together with novel next generation sequencing data have exponentially increased the information on aerobic aromatic degradation pathways. However, the majority of proteins in public databases have not been experimentally characterized and homology-based methods are still the most routinely used approach to assign protein function, allowing the propagation of misannotations. What is AromaDeg? AromaDeg is a web-based resource (http://aromadeg.siona.helmholtz-hzi.de) targeting aerobic degradation of aromatic compounds that comprises manually curated databases constructed based on a phylogenomic approach. AromaDeg allows query and data mining of novel genomic, metagenomic or metatranscriptomic data sets to identify protein sequences of key catabolic protein families of aerobic aromatic degradation. Essentially, each query sequence that belongs to a protein family considered in AromaDeg is aligned with other protein family members and thereby associated to a specific cluster of the respective phylogenetic tree. Further functional annotation and/or substrate specificity may then be inferred from the neighbouring cluster members of experimentally validated function. This approach allows a detailed characterization of individual protein superfamilies as well as high-throughput functional classifications. AromaDeg thus addresses the deficiencies of homology based function prediction and provides more accurate annotations of new biological data related to aerobic aromatic biodegradation pathways. How to query the database? 1. The web interface (Fig. 1) allows querying the database by uploading a candidate set of aminoacid sequences in fasta format . Paste the sequences in the text window or upload a FASTA file (up to 20 MB). 2. Provide a job title and a valid email address, as the link with the result page will be sent via email. 3. In the "Select Database" section (Fig. 2), it is possible to select subsets of the database. If nothing is selected, default settings (including all protein families comprised in AromaDeg, as shown in Fig. 2) will be used. In a first step, AromaDeg compares the user query with the database using BLAST to assign the query to a family before running the phylogenetic pipeline. The default settings are 50% sequence identity and a minimum length of 50 aminoacids, (Fig. 3) but the user can edit these parameters. Some sequences that belong to a certain family may exhibit lower sequence identity and, thus, the user may change the blast parameters. However, the use of a lower minimum homology down to 30% may also result in the inclusion of false positives. These false positives are, nonetheless, visible as outliers in the phylogenetic trees. If low homologies are used, we strongly recommend a careful evaluation of the obtained results. 4. Launch the query by clicking on the "Submit" button.

[AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

[AromaDeg extended manual] Background Understanding prokaryotic transformation of recalcitrant pollutants and the in-situ metabolic nets requires the integration of a massive amount of biological data. Decades of biochemical studies together with novel next generation sequencing data have exponentially increased the information on aerobic aromatic degradation pathways. However, the majority of proteins in public databases have not been experimentally characterized and homology-based methods are still the most routinely used approach to assign protein function, allowing the propagation of misannotations. What is AromaDeg? AromaDeg is a web-based resource (http://aromadeg.siona.helmholtz-hzi.de) targeting aerobic degradation of aromatic compounds that comprises manually curated databases constructed based on a phylogenomic approach. AromaDeg allows query and data mining of novel genomic, metagenomic or metatranscriptomic data sets to identify protein sequences of key catabolic protein families of aerobic aromatic degradation. Essentially, each query sequence that belongs to a protein family considered in AromaDeg is aligned with other protein family members and thereby associated to a specific cluster of the respective phylogenetic tree. Further functional annotation and/or substrate specificity may then be inferred from the neighbouring cluster members of experimentally validated function. This approach allows a detailed characterization of individual protein superfamilies as well as high-throughput functional classifications. AromaDeg thus addresses the deficiencies of homology based function prediction and provides more accurate annotations of new biological data related to aerobic aromatic biodegradation pathways. How to query the database? 1. The web interface (Fig. 1) allows querying the database by uploading a candidate

set of aminoacid sequences in fasta format. Paste the sequences in the text window or upload a FASTA file (up to 20 MB).

2. Provide a job title and a valid email address, as the link with the result page will be sent via email.

3. In the "Select Database" section (Fig. 2), it is possible to select subsets of the

database. If nothing is selected, default settings (including all protein families comprised in AromaDeg, as shown in Fig. 2) will be used.

In a first step, AromaDeg compares the user query with the database using BLAST to assign the query to a family before running the phylogenetic pipeline. The default settings are 50% sequence identity and a minimum length of 50 aminoacids, (Fig. 3) but the user can edit these parameters. Some sequences that belong to a certain family may exhibit lower sequence identity and, thus, the user may change the blast parameters. However, the use of a lower minimum homology down to 30% may also result in the inclusion of false positives. These false positives are, nonetheless, visible as outliers in the phylogenetic trees. If low homologies are used, we strongly recommend a careful evaluation of the obtained results.

4. Launch the query by clicking on the "Submit" button.

Page 2: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 1 – Screenshot of the Query section of the AromaDeg web interface with a query example: a protein sequence (GenBank: AAD12607) of Ralstonia sp. U2.

Fig. 2 – Screenshot of the Query section of the AromaDeg web interface showing database selection options.

Page 3: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 3 – Screenshot of the Query section of the AromaDeg web interface showing the editable BLAST options. How the pipeline works? Upon submission, your sequences are uploaded to the AromaDeg server, the validity of the fasta format is verified and a confirmation message is displayed (Fig. 4). NOTE: the header names of your query will be modified by replacing the non-alphanumeric characters [<()*/\{}|.,] with "_".

Fig. 4 – Message displayed upon successfully query submission. The query runs following these steps: 1. A blast comparison between the query and the curated database is performed

to identify which sequences of the query have significant homology to sequences in the database, and if so, in which protein family the homologous sequences are located.

2. Once the candidate sequences from the query are identified, they are included into the sequence set (protein family) matching them, and a global multiple alignment is done using MAFFT. This will produce a file with the candidate sequences aligned to the target set.

Page 4: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

3. The multiple alignment is then used to build a phylogenetic tree using the Neighbour-Joining algorithm and the Ka/Ks model, with bootstrap values calculated to provide branch support (from 25 iterations). The tree file produced in Newick format is then used by the Newick Utilities program suite to generate SVG images of the final tree and independent clusters.

Upon completion of the query, the website will send you an email with the web address of your result page (Fig. 5), which will contain the result files (Fig. 6). In case your query has no match with the AromaDeg database, a failure message will be sent. NOTE: the provided link to the web address will only be valid for 24 hours.

Fig. 5 – Email sent to the user containing the web address to access the ”Index of Results”.

Fig. 6 – Screenshot of the Index of results obtained with a test query in AromaDeg.

Page 5: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Which files are available for download in the Result page? Candidates.faa file The candidates.faa fasta file contains all the headers and corresponding protein sequences of the query that matched a protein family of AromaDeg. In the given example, the query only contained 1 protein sequence and it matched a protein family of AromaDeg, as shown in Fig. 7.

Fig. 7 – Screenshot of the candidate.faa file obtained with a test query in AromaDeg. CSV file The candidates_trees.csv is a text file that can be regarded as a preliminary result. It summarizes, in a comma-separated values format, which candidate sequences of the query are homologous to proteins of one of the families covered by AromaDeg as discovered by the BLAST search and to which protein family they match. In the example, the Query_test_AAD12607_Ralstonia_sp_U2 matched the salicylate family of Rieske non-heme iron oxygenases with sequence AAD12607 from Ralstonia sp. U2 as nearest neighbor (Fig. 8).

Fig. 8 – Screenshot of the candidates_trees.csv file obtained with a test query in AromaDeg. It contains 3 columns: the query sequence names, the best blast hit, and the name of the matching tree. SVG file [e.g. Salicylate_tree.svg] In each case a protein is homologous to members of a family covered by AromaDeg, AromaDeg generates a phylogenetic tree covering all members of that protein family, including the matched query sequences marked with a red dot. The tree is available as a Scalable Vector Graphic (an XML-based vector image format for two-dimensional graphics with support for interactivity and animation) (Figs. 9, 10).

Page 6: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

The trees showing the phylogenetic relationship among members of a protein family are comprised of several branches supported by bootstrap analysis (see colored clusters, Fig. 9). For each specific cluster or branch that contains 1 or more protein sequences of the query, an additional SVG file is created. In this case, the SVG file name comprises the protein family and a roman number that identifies the cluster [e.g. Salicylate-II.svg] (Fig. 11).

Fig. 9 – Phylogenomic tree of the α-subunits of the salicylate family of Rieske non-heme iron oxygenases generated by AromaDeg including the query sequence (marked with a red dot). All phylogenetic trees generated by AromaDeg were inspected for evident branches (clusters) supported by bootstrap analysis and for proteins of documented function.

Page 7: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 10 – Detail of Fig. 9, showing part of the cluster II of α-subunits of the salicylate family of Rieske non-heme iron oxygenases generated by AromaDeg including the query (marked with a red dot). The protein headers contain the tree cluster in roman numbers (e.g. II), the accession number (AAD12607), the name of the organism in which the protein was described (Ralstonia sp. U2) and a 3- or 4- code letter that indicates the enzyme function and/or substrate specificity (see Fig.11).

Page 8: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 11 – Phylogenomic tree of cluster II of the α-subunits of Rieske non-heme iron oxygenases of the salicylate family generated by AromaDeg including the query (marked with a red dot). PDF file [e.g. Salicylate.pdf] For each matched protein family, AromaDeg generates a pdf file that contains a table describing all comprised clusters (including function and substrate description). In the given example, only one table/PDF file was generated (Fig. 12), as the single protein query matched one tree - the salicylate family of Rieske non-heme iron oxygenases. The query is related to cluster II enzymes that, as shown in the PDF table, are defined as salicylate 5-hydroxylases. FASTA and NWK files [e.g. candidates_Salicylate_aligned.fa; candidates_Salicylate_nwk] Besides the SVG file, AromaDeg also provides per each matched phylogenetic tree, a fasta file of the multiple sequence alignment (including the query) (Fig. 13) and the corresponding Newick format tree file (Fig. 14). The aligned fasta files and the Newick format tree files allow the user to build and visualize phylogenetic trees using other computational software.

Page 9: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 12 – Contents of the generated PDF file: a table describing the phylogenomic clusters of the α-subunits of the salicylate family of Rieske non-heme iron oxygenases and a list of the 3- or 4- letter code that indicates the experimentally validated function and/or substrate of members of this family.

Page 10: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 13 – Screenshot of the FASTA file that contains the multiple sequence alignment of the test query and the matched tree of AromaDeg.

Page 11: [AromaDeg extended manual]aromadeg.siona.helmholtz-hzi.de/database/AromaDeg_MANUAL.pdf · of biochemical studies together with novel next generation sequencing data have exponentially

Fig. 14 – Screenshot of the NWK file that contains the newick tree format file of the test query and the matched tree of AromaDeg.