1
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar , Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane Szafron, Russ Greiner, and David S. Wishart Departments of Computing Science and Biological Sciences University of Alberta Edmonton AB T6E 2E9 [email protected] [email protected] Abstract BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal, plasmid, and contig) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual and hyperlinked image output. BASys uses more than 30 programs to determine nearly 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3- D structure, reactions, and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colourful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 hours for an average bacterial chromosome (5 Megabases). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at: http:/ / wishart.biology.ualberta.ca/ basys Genomic Sequence Data (Optional) Gene Identification Data Head Node SWISSPROT CCDB Reference DB Similarity Search Data Submission BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1]. E. coli D. melanogaster H. sapiens C. elegans S. cerevisiae Model Organism Similarity Search Compute Node Compute Node KEGG Metabolite Analysis Sequence Analysis Pfam PROSITE PredictSPTM etc. Data Scheduling BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes. Annotation Reports BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML- formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML- formatted “gene card”. The card is hyperlinked where applicable to external references. Text- only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing. References 1. Delcher AL et al. (1999) Nucleic Acid Res. 27 :4636- 41. 2. Ilioupoulos I et al. (2003) Bioinformatics 19 :717- 26. 3. Stothard P. and Wishart DS (2005) Bioinformatics 21 :537- 39. Report Generation CGview Annotation Reports Search Capability BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps. BASys Annotation Pipeline The BASys annotation engine combines database comparison and computational sequence analysis in its annotation pipeline. Translated coding sequences are initially compared using BLAST to the expertly annotated reference databases UniProt and the CyberCell comprehensive molecular database on Escherichia coli . The similarity score between the query and database sequence is compared to the threshold value for each annotation type and qualifying annotations are transitively applied to the query sequence. BASys attempts to fill the remaining annotations with additional similarity searches and sequence analyses. BLAST searches are conducted against the protein sequences of C. elegans , human, yeast, and Drosophila ; a non-redundant database of bacterial protein sequences, the PDB , KEGG, and COG databases. Various sequence analyses are also performed including Pfam, PROSITE, signal peptide and transmembrane domain predictions, and predicted secondary structure with PSIPRED. If the sequence has sufficient similarity to a sequence represented in the PDB database, then BASys may use HOMODELLER to generate a homology model and subsequently perform a structural analysis using VADAR. Several additional annotations, such as protein molecular weight, isoelectric point, and operon structure are calculated directly from the chromosomal, protein-coding nucleotide, and translated protein sequence data. In all collection of nearly 60 distinct annotations is generated for each gene. Validation BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% . Structure Analysis Homodeller VADAR PDB

BASys: A Web Server for Automated Bacterial Genome Annotation · and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BASys: A Web Server for Automated Bacterial Genome Annotation · and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps

BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar†, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane

Szafron, Russ Greiner, and David S. Wishart‡

Departments of Computing Science and Biological SciencesUniversity of Alberta

Edmonton AB T6E 2E9

[email protected][email protected]

AbstractBASys (Bacterial Annotation System) is a web server that supports automated, in- depth annotation of bacterial genomic (chromosomal, plasmid, and contig) sequences. It accepts raw DNA sequence data and an optional list of gene identif icat ion information and provides extensive textual and hyperlinked image output. BASys uses more than 30 programs to determine nearly 60 annotation subfields for each gene, including gene/ protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3- D structure, reactions, and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colourful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all result ing gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 hours for an average bacterial chromosome (5 Megabases). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at: http:/ / wishart.biology.ualberta.ca/ basys

Genomic Sequence Data

(Optional) Gene

Identif icat ionData

Head Node

SWISSPROT

CCDB

Reference DB Similarity Search

Data SubmissionBASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identif icat ion data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1].

E. col iD. melanogaster

H. sapiensC. elegans

S. cerevisiae

Model Organism Similarity Search

Compute Node

Compute Node

KEGG

Metabolite Analysis Sequence Analysis

Pfam

PROSITE

PredictSPTM

etc.

Data SchedulingBASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes.

Annotation ReportsBASys uses CGView [3] to generate clickable genome maps for navigat ing the genome data. An HTML- formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image f iles. Each gene label is hyperlinked to its corresponding HTML- formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for off line viewing.

References1. Delcher AL et al . (1999) Nucleic Acid Res.

27 :4636- 41.2. Ilioupoulos I et al . (2003) Bioinform at ics

19 :717- 26.3. Stothard P. and Wishart DS (2005)

Bioinformat ics 21 :537- 39.

Report Generation

CGview

Annotation Reports

Search CapabilityBASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps.

BASys Annotation Pipeline

The BASys annotat ion engine combines database comparison and computat ional sequence analysis in its annotat ion pipeline. Translated coding sequences are init ially compared using BLAST to the expert ly annotated reference databases UniProt and the CyberCell comprehensive molecular database on Escher ichia col i . The similarity score between the query and database sequence is compared to the threshold value for each annotation type and qualifying annotat ions are transit ively applied to the query sequence. BASys attempts to f ill the remaining annotat ions with addit ional similarity searches and sequence analyses. BLAST searches are conducted against the protein sequences of C. elegans, human, yeast, and Drosophi la; a non- redundant database of bacterial protein sequences, the PDB , KEGG, and COG databases. Various sequence analyses are also performed including Pfam, PROSITE, signal peptide and transmembrane domain predict ions, and predicted secondary structure with PSIPRED. If the sequence has suff icient similarity to a sequence represented in the PDB database, then BASys may use HOMODELLER to generate a homology model and subsequently perform a structural analysis using VADAR. Several addit ional annotations, such as protein molecular weight, isoelectric point, and operon structure are calculated direct ly from the chromosomal, protein- coding nucleot ide, and translated protein sequence data. In all collect ion of nearly 60 dist inct annotat ions is generated for each gene.

ValidationBASys annotat ions were compared to a set of expert ly annotated proteins from C. trachomatis [2]. BASys annotat ions agreed with the expert annotations 762 t imes out of 894. The sensit ivity is 94% ; the specif icity is 73% .

Structure Analysis

Homodeller

VADAR

PDB