Rachel Adams, Jerry Choate, Nathan Harrelson, Divya Mistry, and Whitney Smith
BINF 4360, Fall 2007
Overview
Goals Implementation Interface Images Final product Conclusions
Goals
Create a dynamic map of the Shewenella Oneidensis MR-1 genome
Populate local database with relevant information from web-based databases
Provide an efficient searching algorithm for key terms
Implement user-friendly navigation and readability
Implementation
SQL Schema Parsing Databases
Parsing
XPath XPath was used to quickly parse through XML documents generated from
NCBI’s SOAP interface. my $xp=XML::XPath->new(filename=>$file);
# gets the locus tag
foreach $var ($xp->find('//Gene-ref')->get_nodelist) {
$name = $var->find('Gene-ref_locus')->string_value;
$locus = $var->find('Gene-ref_locus-tag')->string_value;
}
LWP::Simple Simple was used to grab content from a url so it could be easily written to an
XML file.
Regular Expressions Regular expressions were used to parse through HTML files, match
specific string patterns, and manipulate text.
Schema
kegg
areaarea_id integer
href text
title text
target text
coords text
img_id integer
area_area_id_seqsequence_name name
last_value bigint
increment_by bigint
max_value bigint
min_value bigint
cache_value bigint
log_cnt bigint
is_cycled boolean
is_called boolean
img_img_id_seqsequence_name name
last_value bigint
increment_by bigint
max_value bigint
min_value bigint
cache_value bigint
log_cnt bigint
is_cycled boolean
is_called boolean
imgimg_id integer
map varchar(5)
imgplacementimg_id integer
tilex integer
tiley integer
gene_id text
kegg_id text
ncbi_genesid integer
name text
locus_tag text
month integer
day integer
year integer
location text
description text
function text
cog_id text
gi text
img_id text
pdbid text
pdb text
ncbi_proteinslocus_tag text
date date
defintion text
description text
gene text
Databases
NCBI Local databases were populated using information retrieved from gene, protein, and
3D domain web-based databases.
COG Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing
protein sequences encoded in complete genomes, representing major phylogenetic lineages.
IMG The Integrated Microbial Genomes (IMG) system's goal is to facilitate the visualization
and exploration of genomes from a functional and evolutionary perspective.
KEGG Knowledge-based methods for uncovering higher-order systemic behaviors of the cell
and the organism from genomic information is stored in KEGG, Kyoto Encyclopedia of Genes and Genomes.
More Databases
MIST The Microbial Signal Transduction database contains the signal transduction proteins
for 591 complete bacterial and archaeal organisms.
ORNL The Genome Analysis and System Modeling Group of the Life Sciences Division of
ORNL provides bioinformatics and analytic services and resources to collaborators, predicts prospective gene and protein models for analysis, and provides user services for the general community.
PDB The RCSB PDB provides a variety of tools and resources for studying the structures
of biological macromolecules and their relationships to sequence, function, and disease.
ShewCyc ShewCyc is a part of BioCyc, a collection of 371 Pathway/Genome Databases, which
describes the genome and metabolic pathways of the Shewenella Oneidensis MR-1 genome.
Interface
Functions provided by Google’s Map API were used to display pathways of the Shewenella genome.
A small overview map is provided to give a bird’s eye view of the entire image. The current view is indicated with a translucent box.
The user has the ability to view the pathways using 5 different zoom levels. Text balloons show information relevant to the user’s selected target.
A search bar offers quick targeting of a user’s query of interest. The user can either pan over the images and click on areas of interest or
enter a query in a search bar to find specific information. If the user submits a term to be queried, relevant targets are indicated on the
map with colored pins.
Images
ImageMagick is a free software suite to create, edit, and compose bitmap images. The main functions that we took advantages of
included the ability to resize, sharpen, pad, and stitch together images.
We also were able to create a composite image by combining several (212) separate images.
Placing the images within 16384 by 16384 pixels took strategic manipulation and tedious offset calculation.
Final Product
Zoomed image
Final Product
Query for glycogen
Final Product
Query for ATP
Conclusions
Using GoogleMaps we were able to create a searchable map of pathways in the Shewenella genome.
Efficient parsing methods made collecting and querying data far simpler.
With more time, additional improvements could be implemented to increase the usability of this application. Currently we offer links to images, but it would be optimal to have
thumbnails of the pictures themselves readily viewable. GoogleWebToolkit has several functions that would make more
information available for the user. Tabs on text balloons could separate data into topical subgroups. Overlaying a transparent map on top of the current map could be a useful tool for comparing two pathways.
Additionally, the overall scope of the project would be enhanced if we had even more indepth zoom levels such that the user could actually see the sequence of the amino acids and nucleotides.