FramePlot: a new implementation of the Frame analysis for predicting protein-coding regions in bacterial DNA with a high G+C content

FramePlot: a new implementation of the Frame analysis forpredicting protein-coding regions in bacterial DNA with a high

G+C content

Jun Ishikawa *, Kunimoto HottaDepartment of Bioactive Molecules, National Institute of Infectious Diseases, Tokyo 162-8640, Japan

Received 8 February 1999; received in revised form 10 March 1999; accepted 10 March 1999

Abstract

FramePlot is a web-based tool for predicting protein-coding regions in bacterial DNA with a high G+C content, such asStreptomyces. The graphical output provides for easy distinction of protein-coding regions from non-coding regions. The plotis a clickable map. Clicking on an ORF provides not only the nucleotide sequence but also its deduced amino acid sequence.These sequences can then be compared to the NCBI sequence database over the Internet. The program is freely available foracademic purposes at http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl. z 1999 Federation of European Microbiological So-cieties. Published by Elsevier Science B.V. All rights reserved.

Keywords: High G+C content; Streptomyces

Genes of bacteria which have a high G+C contentgenome DNA such as Streptomyces have biased co-don usage. This results in extremely high G+C dis-tribution at the third letter of each codon. Strepto-myces genes actually have an average third-letterG+C content of 92%, which was calculated from2008 Streptomyces genes recorded in the CUTG da-tabase [1] based on the GenBank database release108. This characteristic enables the prediction of pro-tein-coding region in such bacteria. The Frame anal-ysis was ¢rst developed by Bibb et al. [2] and wasimplemented on a VAX system. Although it is one ofthe essential analyses for studying Streptomyces ge-netics at the present time, the software to perform

the analysis has been implemented on only a fewplatforms.

FramePlot is a new implementation of the Frameanalysis, with many improvements. The program in-terface is provided by three web pages: a query se-quence submission page, a results page, and a featurepage. FramePlot calculates third-letter G+C contentwithin a window of a set `window size' (default 40codons) and plots the data in the middle of the win-dow. The window is moved along the sequence by aset `step size' (default 5 codons). These running pa-rameters permit a trade-o¡ between speed and reso-lution of the plot. A higher value of `step size' as wellas `window size' yields greater speed, but also a low-ered resolution of the plot. In almost all cases, de-fault values will give su¤cient results. FramePlotalso ¢nds all open reading frames (ORFs) starting

0378-1097 / 99 / $20.00 ß 1999 Federation of European Microbiological Societies. Published by Elsevier Science B.V. All rights reserved.PII: S 0 3 7 8 - 1 0 9 7 ( 9 9 ) 0 0 1 5 0 - 0

FEMSLE 8745 29-4-99

* Corresponding author. Tel. : +81 (3) 5285 1111;Fax: +81 (3) 5285 1272; E-mail: [email protected]

FEMS Microbiology Letters 174 (1999) 251^253

with a selected `start codon(s)' in the sequence. TheORFs are plotted as bars with potential start andstop codons. The data of each frame is indicatedby di¡erent colors (color mode) or line-style (black

and white mode). Fig. 1 shows the result of the anal-ysis of the sequence containing the kan gene [3].There is an apparent ORF with extremely highthird-letter G+C content at the 434^1288 region

FEMSLE 8745 29-4-99

Fig. 1. Result of the analysis of the sequence containing the kan gene. A: Data of each frame are indicated by a di¡erent line-style. Puta-tive ORFs are plotted as bars with potential start (`s ') and stop (`|') codons. B: Deduced amino acid sequence generated by clicking onthe ORF at positions 434^1288 is indicated. The button for searching the NCBI BLAST server is also shown below the sequence.

J. Ishikawa, K. Hotta / FEMS Microbiology Letters 174 (1999) 251^253252

(Fig. 1A). Clicking on the ORF yields the nucleotideand its deduced amino acid sequence with theirframe number, nucleotide position, length, and anal-ysis date (Fig. 1B). Furthermore, the sequences canbe compared to the GenBank database over the In-ternet by using the NCBI BLAST server [4]. Thisfeature helps in ¢nding a new gene.

In the course of the development of the program,we found a new gene. There is a small ORF withhigh third-letter G+C content at the downstream ofthe kan gene. Deduced amino acid sequence of thesmall ORF shows homology to hyaluronidase andchondroitinase. Further sequencing study revealedthat this small ORF was a 3P region of a largerORF. The ORF could encode a 77.3-kDa proteinconsisting of 721 amino acids with 94.2% G+C inthe third position of the codons. Furthermore, the4.2-kb SphI fragment containing the ORF conferredthe ability to grow on hyaluronic acid as a sole car-bon source to S. lividans TK21. These results wouldindicate that the gene encodes a hyaluronidase.

FramePlot can accept sequence data in any for-mat. In order to analyze low quality sequence data,for example high throughput genomic sequences, allalphabetic characters are acceptable in addition toA, C, G, and T, while non-alphabetic characters,for example digits, spaces, etc., can be removed bythe program.

FramePlot is freely accessible for academicpurposes at http://www.nih.go.jp/~jun/cgi-bin/frame

plot.pl. The source code is also available at thesame site. To install the program, Perl version 5.0and Fly version 1.6 or later are required [5]. Com-mercial users should contact the author for licensingdetails.

Acknowledgments

We would like to thank Dr. H. Ikeda for thoroughtests of the program. We give special thanks to Dr.M.J. Bibb for encouraging comments, Dr. J.A. Gilfor critical reading of the manuscript, and Mr. andMrs. Summers for correcting the manuscript.

References

[1] Nakamura, Y., Gojobori, T. and Ikemura, T. (1997) Codonusage tabulated from the international DNA sequence data-bases. Nucleic Acids Res. 25, 244^245

[2] Bibb, M.J., Findlay, P.R. and Johnson, M.W. (1984) Therelationship between base composition and codon usage inbacterial genes and its use for the simple and reliable identi-¢cation of protein-coding sequences. Gene 30, 157^166.

[3] Ishikawa, J. and Hotta, K. (1991) Nucleotide sequence andtranscriptional start point of the kan gene encoding an amino-glycoside 3-N-acetyltransferase from Streptomyces griseusSS-1198PR. Gene 108, 127^132.

[4] http://www.ncbi.nlm.nih.gov/BLAST/[5] Gleeson, M. http://www.unimelb.edu.au/£y/£y.html

FEMSLE 8745 29-4-99

J. Ishikawa, K. Hotta / FEMS Microbiology Letters 174 (1999) 251^253 253

Documents

FramePlot: a new implementation of the Frame analysis for predicting protein-coding regions in bacterial DNA with a high G+C content