View
218
Download
1
Category
Tags:
Preview:
Citation preview
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar†, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane
Szafron, Russ Greiner, and David S. Wishart‡
Departments of Computing Science and Biological SciencesUniversity of Alberta
Edmonton AB T6E 2E9
†gary.vandomselaar@ualberta.ca ‡david.wishart@ualberta.ca
Abstract
BASys (Bacterial Annotation System) is a web server that
supports automated, in-depth annotation of bacterial genomic
(chromosomal, plasmid, and contig) sequences. It accepts
raw DNA sequence data and an optional list of gene
identification information and provides extensive textual and
hyperlinked image output. BASys uses more than 30
programs to determine nearly 60 annotation subfields for
each gene, including gene/protein name, GO function, COG
function, possible paralogues and orthologues, molecular
weight, isoelectric point, operon structure, subcellular
localization, signal peptides, transmembrane regions,
secondary structure, 3-D structure, reactions, and pathways.
The depth and detail of a BASys annotation matches or
exceeds that found in a standard SwissProt entry. BASys
also generates colourful, clickable and fully zoomable maps
of each query chromosome to permit rapid navigation and
detailed visual analysis of all resulting gene annotations. The
textual annotations and images that are provided by BASys
can be generated in approximately 24 hours for an average
bacterial chromosome (5 Megabases). BASys annotations
may be viewed and downloaded anonymously or through a
password protected access system. The BASys server and
databases can also be downloaded and run locally. BASys is
accessible at:
http://wishart.biology.ualberta.ca/basys
Abstract
BASys (Bacterial Annotation System) is a web server that
supports automated, in-depth annotation of bacterial genomic
(chromosomal, plasmid, and contig) sequences. It accepts
raw DNA sequence data and an optional list of gene
identification information and provides extensive textual and
hyperlinked image output. BASys uses more than 30
programs to determine nearly 60 annotation subfields for
each gene, including gene/protein name, GO function, COG
function, possible paralogues and orthologues, molecular
weight, isoelectric point, operon structure, subcellular
localization, signal peptides, transmembrane regions,
secondary structure, 3-D structure, reactions, and pathways.
The depth and detail of a BASys annotation matches or
exceeds that found in a standard SwissProt entry. BASys
also generates colourful, clickable and fully zoomable maps
of each query chromosome to permit rapid navigation and
detailed visual analysis of all resulting gene annotations. The
textual annotations and images that are provided by BASys
can be generated in approximately 24 hours for an average
bacterial chromosome (5 Megabases). BASys annotations
may be viewed and downloaded anonymously or through a
password protected access system. The BASys server and
databases can also be downloaded and run locally. BASys is
accessible at:
http://wishart.biology.ualberta.ca/basys
Genomic Sequence Data
Genomic Sequence Data
(Optional) Gene
IdentificationData
(Optional) Gene
IdentificationData
Head NodeHead Node
SWISSPRO
T
CCDB
Reference DB
Similarity Search
Data Submission
BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1].
Data Submission
BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1].
E. coli
D. melanogaster
H. sapiens
C. elegans
S. cerevisiae
Model Organism
Similarity Search
Compute Node
Compute Node
Compute Node
Compute Node
KEGG
Metabolite Analysis Sequence Analysis
PfamPfam
PROSITEPROSITE
PredictSPTMPredictSPTM
etc.etc.
Data Scheduling
BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes.
Data Scheduling
BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes.
Annotation Reports
BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML-formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML-formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing.
Annotation Reports
BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML-formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML-formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing.
References
1. Delcher AL et al. (1999) Nucleic Acid Res. 27:4636-41.2. Ilioupoulos I et al. (2003) Bioinformatics 19:717-26.3. Stothard P. and Wishart DS (2005) Bioinformatics 21:537-39.
References
1. Delcher AL et al. (1999) Nucleic Acid Res. 27:4636-41.2. Ilioupoulos I et al. (2003) Bioinformatics 19:717-26.3. Stothard P. and Wishart DS (2005) Bioinformatics 21:537-39.
Report Generation
CGview
Annotation Reports
Annotation Reports
Search Capability
BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps.
Search Capability
BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps.
BASys Annotation Pipeline
The BASys annotation engine combines database comparison and
computational sequence analysis in its annotation pipeline. Translated
coding sequences are initially compared using BLAST to the expertly
annotated reference databases UniProt and the CyberCell
comprehensive molecular database on Escherichia coli. The similarity
score between the query and database sequence is compared to the
threshold value for each annotation type and qualifying annotations are
transitively applied to the query sequence. BASys attempts to fill the
remaining annotations with additional similarity searches and sequence
analyses. BLAST searches are conducted against the protein
sequences of C. elegans, human, yeast, and Drosophila; a non-
redundant database of bacterial protein sequences, the PDB , KEGG,
and COG databases. Various sequence analyses are also performed
including Pfam, PROSITE, signal peptide and transmembrane domain
predictions, and predicted secondary structure with PSIPRED. If the
sequence has sufficient similarity to a sequence represented in the PDB
database, then BASys may use HOMODELLER to generate a homology
model and subsequently perform a structural analysis using VADAR.
Several additional annotations, such as protein molecular weight,
isoelectric point, and operon structure are calculated directly from the
chromosomal, protein-coding nucleotide, and translated protein
sequence data. In all collection of nearly 60 distinct annotations is
generated for each gene.
BASys Annotation Pipeline
The BASys annotation engine combines database comparison and
computational sequence analysis in its annotation pipeline. Translated
coding sequences are initially compared using BLAST to the expertly
annotated reference databases UniProt and the CyberCell
comprehensive molecular database on Escherichia coli. The similarity
score between the query and database sequence is compared to the
threshold value for each annotation type and qualifying annotations are
transitively applied to the query sequence. BASys attempts to fill the
remaining annotations with additional similarity searches and sequence
analyses. BLAST searches are conducted against the protein
sequences of C. elegans, human, yeast, and Drosophila; a non-
redundant database of bacterial protein sequences, the PDB , KEGG,
and COG databases. Various sequence analyses are also performed
including Pfam, PROSITE, signal peptide and transmembrane domain
predictions, and predicted secondary structure with PSIPRED. If the
sequence has sufficient similarity to a sequence represented in the PDB
database, then BASys may use HOMODELLER to generate a homology
model and subsequently perform a structural analysis using VADAR.
Several additional annotations, such as protein molecular weight,
isoelectric point, and operon structure are calculated directly from the
chromosomal, protein-coding nucleotide, and translated protein
sequence data. In all collection of nearly 60 distinct annotations is
generated for each gene.
Validation
BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% .
Validation
BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% .
Structure Analysis
HomodellerHomodeller
VADARVADAR
PDB
Recommended