Upload
nakulgv
View
18
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Report on Gene recognition
Citation preview
Gene Recognition
A project report submitted to
M S Ramaiah Institute of TechnologyAn Autonomous Institute, Affiliated to
Visvesvaraya Technological University, Belgaum
in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering
Submitted by
Mudra Hegde 1MS07CS052Nakul G V 1MS07CS053
Under the guidance of
Veena G SAssistant Professor
Computer Science and EngineeringM S Ramaiah Institute of Technology
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
M.S.RAMAIAH INSTITUTE OF TECHNOLOGY(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054www.msrit.edu
May 2011
Gene Recognition
A project report submitted to
M. S. Ramaiah Institute of TechnologyAn Autonomous Institute, Affiliated to
Visvesvaraya Technological University, Belgaum
in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering
Submitted by
Mudra Hegde 1MS07CS052Nakul G V 1MS07CS053
Under the guidance of
Veena G SAssistant Professor
Computer Science and EngineeringM S Ramaiah Institute of Technology
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
M. S. RAMAIAH INSTITUTE OF TECHNOLOGY(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054www.msrit.edu
May 2011
Department of Computer Science & Engineering
M. S. Ramaiah Institute of Technology(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
CERTIFICATE
This is to certify that the project work titled Gene Recognition is carried out by
1MS07CS052 Mudra Hegde and 1MS07CS053 Nakul G V in partial fulfillment for the
award of degree of Bachelor of Engineering in Computer Science and Engineering during the
year 2011. The Project report has been approved as it satisfies the academic requirements
with respect to the project work prescribed for Bachelor of Engineering Degree. To the best
of our understanding the work submitted in this report has not been submitted, in part or full,
for the award of any diploma or degree of this or any other University.
Veena G S (Dr.R. Selvarani) Guide Head, Dept of CSE
(External Examiner)
Department of Computer Science & Engineering
M. S. Ramaiah Institute of Technology(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
DECLARATION
We hereby declare that the entire work embodied in this report has been carried out by us at
M. S. Ramaiah Institute of Technology under the supervision of Veena G S. This report has
not been submitted in part or full for the award of any diploma or degree of this or any other
University.
1MS07CS052 MUDRA HEGDE1MS07CS053 NAKUL G V
Abstract
A gene consists of three regions, viz, a promoter region, a coding region and a terminator
region. Prokaryotic genes are relatively easy to find compared to Eukaryotic genes because
they lack introns. Genes that are expressed usually have introns that interrupt the coding
sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in
mature mRNA (exons) interrupted by introns. The recognition of the promoter regions in a
eukaryotic genome is a daunting task. There is a lot of sequencing data that has been
generated and they need to be annotated. It helps in having a better understanding of the
organism, in drug discovery and also for finding a cure for various genetic disorders.
Gene Finding or gene prediction in DNA has become one of the foremost computational
biology problems for two reasons. Firstly, because completely sequenced genomes have
become readily available and most importantly, because of the need to extract actual
biological knowledge from this data to explain the molecular interactions that occur in cells
and to define important cellular pathways. Discovering the location of genes on the genome
is the first step towards building such a body of knowledge.
In our work we try to find the coding and non-coding regions of an unlabeled string of DNA
nucleotides using Hidden Markov model. A Hidden Markov Model is a generalization of a
Markov chain, in which each (“internal”) state is not directly observable (hence the term
hidden) but produces (“emits”) an observable random output (“external”) state, also called
“emission”, according to a given stationary probability law. HMM employs dynamic
programming algorithms like Viterbi, Forward-Backward algorithms to aid in gene
recognition.
i
ACKNOWLEDGEMENT
We consider ourselves privileged to express gratitude and respect towards all those who
guided us through the completion of this project.
Firstly, we would like to thank Dr. R. Selva Rani, Head of Department, Department of
Computer Science and Engineering, MSRIT, Bangalore for giving us the opportunity to do
this project and also for the excellent lab facilities provided.
We would like to express our gratitude to our project guide, Mrs. Veena G.S; we are
privileged to experience a sustained enthusiastic and involved interest from her side. We
would also like to express our gratitude to Dr K.G Srinivasa, Professor, Department of
Computer Science and Engineering for his support.
We would also like to sincerely thank Mr. Shashidhara H S, Associate Professor, Information
Science, for all the additional inputs he has given and also helped us gather more information
on the various aspects involved in this project. We would like to thank the faculty of the
Department of Computer Science and Engineering and the institute for extending a helping
hand at every juncture of need and making this possible.
Mudra Hegde
Nakul G.V
ii
LIST OF FIGURES
Figure 3.1 Use-case Diagram........................................................................................ 14
Figure 3.2 Flowchart..................................................................................................... 15
Figure 4.1 System Architecture..................................................................................... 18
Figure 4.2 Input Component.......................................................................................... 20
Figure 4.3 Preprocessing.................................................................................................21
Figure 4.4 Output Component........................................................................................22
Figure 4.5 Input Screen Shot1.........................................................................................23
Figure 4.6 Input Screen Shot2.........................................................................................24
Figure 4.7 Output Screen Shot........................................................................................25
Figure 5.1 Gene Model................................................................................................... 30
iii
Contents
Abstract i
Acknowledgements ii
List of Figures iii
Contents iv
1 Introduction
1.1 General Introduction………………………………………………011.2 Statement of the Problem………………………………………….011.3 Objectives of the project…………………………………………....021.4 Current Scope……………………………………………………...021.5 Future Scope……………………………………………………....02
2 Literature Survey
2.1 Prokaryotic Gene Structure……………………………………….032.2 Eukaryotic Gene Structure………………………………………..052.3 Hidden Markov Models…………………………………………...072.4 GENSCAN Algorithm…………………………………………....09
3 Software Requirements Specification
3.1 Introduction………….3.1.1 Purpose…….3.1.2 Scope of the Project…..
3.2 General Description….3.2.1 Project Perspective…3.2.2 End User Expectation….3.2.3 General Constraints…3.2.4 Assumptions and Dependencies…
3.3 Specific Requirements3.3.1 Functional Requirements…3.3.2 Non Functional Requirements….3.3.3 Software Requirements…3.3.4 Hardware Requirements..
3.4 Interface Requirements…3.4.1 User Interface…
3.5 Performance…
4 System Design4.1 Introduction and Design Overview4.2 System Architectural Design
4.2.1 Chosen System Architecture4.2.2 Discussion of Alternative Designs
4.3 Detailed Description of Components4.3.1 Input Component4.3.2 Preprocessing4.3.3 Computational Component4.3.4 Output Component
4.4 User Interface Design4.4.1 Description of User Interface4.4.2 Screen Images4.4.3 Objects and Actions
4.5 Test Plan4.5.1 Features to be Tested
5 Implementation5.1 Hidden Markov Models in Gene Recognition
6 Testing6.1 Introduction
6.1.1 System Overview6.1.2 Test Approach
6.2 Test Cases6.2.1 Case I
6.2.1.1 Purpose6.2.1.2 Input6.2.1.3 Expected Output & Pass/ Fail Criteria6.2.1.4 Test Procedure6.2.1.5 Test Results
6.2.2 Case II6.2.2.1 Purpose6.2.2.2 Input6.2.2.3 Expected Output & Pass/ Fail Criteria6.2.2.4 Test Procedure6.2.2.5 Test Results
7 Conclusion & Scope for Future Work8 Bibliography & References
Chapter 1
INTRODUCTION
1.1 General introduction
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have
a well-defined nucleus and they have a single chromosome which is contained within a
nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a
well-defined nucleus with many chromosomes which are large and linear.
A gene consists of three regions, viz, a promoter region, a coding region and a terminator
region. A promoter is a region of DNA that facilitates the transcription of a particular gene.
Promoters are located near the genes they regulate, on the same strand and typically upstream
(towards the 5' region of the sense strand). In order for the transcription to take place, the
enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a
gene. Promoters contain specific DNA sequences and response elements which provide a
secure initial binding site for RNA polymerase and for proteins called transcription factors
that recruit RNA polymerase.
The coding region of a gene is that portion of a gene's DNA or RNA, composed of exons,
that codes for protein. The region is bounded nearer the 5' end by a start codon and nearer the
3' end with a stop codon. The coding region in mRNA is bounded by the five prime
untranslated region and the three prime untranslated region, which are also parts of the exons.
1.2 Statement of the Problem
This project aims at finding the annotations of a genomic sequence i.e the coding regions of a
gene given an input DNA sequence which is not annotated.
1.3 Objective of the Problem
To find the coding and non-coding regions of an unlabelled string of DNA nucleotides.
1.4 Current Scope
Gene Recognition is still in the stages of research and certain algorithms like GENSCAN and
GrailEXP implement it with certain constraints.
1.5 Future Scope
The biggest challenge that the field of bioinformatics is facing is the huge amount of
unannotated data that it has to deal with. It has huge databases of sequences with it but does
not know what they represent. So, if the genes are annotated it can lead to a lot of mind-
boggling inventions. Newer and better drugs can be developed for various genetic disorders.
Any hereditary diseases can be detected in the offspring at a very early stage. It will also lead
to the better understanding of various organisms.
Chapter 2
LITERATURE SURVEY
2.1 Prokaryotic Gene Structure
The organisms can be broadly classified into Prokaryotes and Eukaryotes. Prokaryotes are
organisms that lack nucleus and membrane-bound organelles. Prokaryotes have a single
chromosome, contained within a nucleoid region rather than a membrane-bound nucleus, but
may also have various small circular pieces of DNA called plasmids spread throughout the
cell. Their gene structure is much simpler than the gene structure of Eukaryotic DNA.
The gene is the functional unit of the DNA. A gene is a unit of heredity in a living organism.
It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain
that has a function in the organism. All living things depend on genes, as they specify all
proteins and functional RNA chains. Genes hold the information to build and maintain an
organism's cells and pass genetic traits to offspring. A modern working definition of a gene is
"a locatable region of genomic sequence, corresponding to a unit of inheritance, which is
associated with regulatory regions, transcribed regions, and or other functional sequence
regions ". The prokaryotic gene is made up of three regions viz.
Promoter Region
Coding Region
Terminator Region
A promoter region is the part of the gene that facilitates the transcription process. Promoters
are located near the genes that they regulate, on the same strand and upstream (5’ region of
the sense strand). For transcription to take place, the enzyme, RNA polymerase has to bind to
a location near the gene. The promoter region contains specific DNA sequences that form the
binding site for the RNA polymerase. The promoter region in the prokaryotes consists of two
sequences. The first one, known as the Pribnow box, is the sequence of six nucleotides
TATAAT. The other sequence consists of the seven nucleotides TTGACAT.
The coding region is the exons (prokaryotic genes are devoid of introns) which form the part
of the DNA that is translated. The coding region starts with the initiation codon (ATG) and
end with the termination codon (TAG or TAA or TGA).
The terminator region is the region that marks the end of the gene or the operon on genomic
DNA for transcription (An operon is a functioning unit of the genomic material containing a
cluster of genes under the control of a single promoter).
Prokaryotic genes generally overlap with each other which make the detection of translation
initiation sites and the predictions of prokaryotic genes difficult.
Many gene finding programs for the prokaryotic genes have been developed, the earlier ones
being ECOPARSE, ORPHEUS, GeneMark.hmm and the more recent ones such as
GeneMark, GeneMarkS, EasyGene and GLIMMER. The GeneMark uses the Bayesian
method. GeneMark.hmm uses the modification of the Viterbi algorithm of HMM (Hidden
Markov Model) with duration to identify the most likely global path through hidden
functional states given the DNA sequence. The extension of the GeneMark algorithm
GeneMark.fba uses the forward-backward algorithm for local posterior decoding used in the
HMM theory. Given the DNA sequence, this program determines an a posteriori probability
for each nucleotide to belong to coding or non-coding region. Also, for any open reading
frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden
states that traverse the ORF as a coding region. Markov models of several orders were
combined in the ‘interpolated’ model for gene prediction in the Glimmer algorithm.
2.2 Eukaryotic Gene Structure
The eukaryotic organism, as opposed to a prokaryotic organism, has a well-defined nucleus
and membrane-bound organelles. In contrast to the prokaryotes, the eukaryotes have many
chromosomes that are large and linear. The chromosomes in eukaryotes are also packaged by
proteins into a structure known as the chromatin due to which very long chromosomes fit
into the nucleus.
The Eukaryotic gene is much more complex than a prokaryotic gene. The eukaryotic gene
consists of
Promoter Region
Exons
Introns
Terminator Region
The Start Site contains a sequence of 7 bases (TATAAAA) called the TATA box. The basal
or core promoter is found in all protein-coding genes. This is in sharp contrast to the
upstream promoter whose structure and associated binding factors differ from gene to gene.
Many different genes and many different types of cells share the same transcription factors
— not only those that bind at the basal promoter but even some of those that bind upstream.
What turns on a particular gene in a particular cell is probably the unique combination of
promoter sites and the transcription factors that are chosen.
Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically
lie upstream of the gene and can have regulatory elements several kilobases away from the
transcriptional start site (enhancers). In eukaryotes, the transcriptional complex can cause the
DNA to bend back on itself, which allows for placement of regulatory sequences far from the
actual site of transcription. Many eukaryotic promoters, between 10 and 20% of all genes,
contain a TATA box (sequence TATAAA), which in turn binds a TATA binding protein
which assists in the formation of the RNA polymerase transcriptional complex. The TATA
box typically lies very close to the transcriptional start site (often within 50 bases).
Eukaryotic promoter regulatory sequences typically bind proteins called transcription
factors which are involved in the formation of the transcriptional complex.
An exon is a nucleic acid sequence that is represented in the mature form of
an RNA molecule after either portions of a precursor RNA (introns) have been removed
by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-
splicing. The mature RNA molecule can be a messenger RNA or a functional form of a non-
coding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the
sequence in the DNA or its RNA transcript.
Genes that are expressed usually have introns that interrupt the coding sequences. A typical
eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called
exons) interrupted by introns. The regions between genes are likewise not expressed, but may
help with chromatin assembly, contain promoters, and so forth.
Intron sequences contain some common features. Most introns begin with the sequence GT
(GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among
them. Intron sequences may be large relative to coding sequences; in some genes, over 90
percent of the sequence between the 5′ and 3′ ends of the mRNA is introns. RNA polymerase
transcribes intron sequences. This means that eukaryotic mRNA precursors must be
processed to remove introns as well as to add the caps at the 5′ end and polyadenylic acid
(poly A) sequences at the 3′ end.
2.3 Hidden Markov Models
A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties
of observed real world data. A good HMM accurately models the real world source of the
observed data and has the ability to simulate the source. Machine Learning techniques based
on HMMs have been successfully applied to problems including speech recognition, optical
character recognition, and problems in computational biology. The main computational
biology problems with HMM-based solutions are protein family profiling, protein binding
site recognition and the gene finding in DNA.
Gene Finding or gene prediction in DNA has become one of the foremost computational
biology problems for two reasons. Firstly, because completely sequenced genomes have
become readily available and most importantly, because of the need to extract actual
biological knowledge from this data to explain the molecular interactions that occur in cells
and to define important cellular pathways. Discovering the location of genes on the genome
is the first step towards building such a body of knowledge.
A basic Markov model of a process is a model where each state corresponds to an observable
event and the state transition probabilities depend only on the current and predecessor state.
This model is extended to a Hidden Markov model for application to more complex
processes, including speech recognition and computational gene finding. A generalized
Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output
symbols, a set of state transition probabilities and a set of emission probabilities. The
emission probabilities specify the distribution of output symbols that may be emitted from
each state. Therefore in a hidden model, there are two stochastic processes; the process of
moving between states and the process of emitting an output sequence. The sequence of state
transitions is a hidden process and is observed through the sequence of emitted symbols.
The field of computational biology involves the application of computer science theories and
approaches to biological and medical problems. Computational biology is motivated by
newly available and abundant raw molecular datasets gathered from a variety of organisms.
Though the availability of this data marks a new era in biological research, it alone does not
provide any biologically significant knowledge. The goal of computational biology is then to
elucidate additional information regarding protein coding, protein function and many other
cellular mechanisms from the raw datasets. This new information is required for drug design,
medical diagnosis, medical treatment and countless fields of research.
Many effective tools based on HMMs have been created for the purpose of gene finding.
Among the most successful tools are Genie, GeneID and HMMGene. Though each tool has a
slightly different model, they each use the technique of combining several specialized
submodels into a larger framework. The submodels correspond directly to different regions
of DNA defined according to their function in the process of gene transcription. Most of the
gene finding tools is hybrid models that include neural network components. In these tools,
instead of an HMM, a neural network models certain regions, such as splice sites. The overall
framework of an HMM-based gene finder combines the submodels into a larger model
corresponding to the organization of a gene in DNA and its functional roles.
2.4 GENSCAN Algorithm
Identifying genes in DNA sequences by computational methods is a topic on which a lot of
research has been made in the past few years. This problem deals with the precise sequence
determinants of transcription, translation and RNA splicing. Softwares for exon prediction
have become common in genome sequencing laboratories to identify genes in newly
sequenced regions.
Early approaches to the gene recognition concentrated on the prediction of individual
functional elements such as promoters, coding regions, splice sites and so on. But, the recent
approaches to gene finding focus on integrating all these factors. Some examples of such
approaches are: FGENEH, GENMARK, Gene ID, Genie, GeneParser and GRAIL II. Two
important limitations of the currently existing algorithms are that they make an assumption
that the input sequence has exactly one gene and the accuracy that is measured by
independent control sets may be less than what was actually presumed. The accuracy is such
that only 50% of the exons are actually identified. GENSCAN uses a general probabilistic
model for the human genomic sequences. The overall architecture that the model uses is the
Generalized Hidden Markov Model. This algorithm differs from the other algorithms in the
followings aspects:
i) A double-stranded DNA sequence is considered with potential genes on both the sides of the
DNA which are analyzed simultaneously and in an integrated fashion.
ii) The assumption that other algorithms have made, that the input sequence has exactly one
complete gene is not made here. This model considers the fact that an input sequence may
have a partial gene, a complete gene, multiple complete genes or no gene at all.
iii) It introduces a new method, Maximum Dependence Decomposition, to model the functional
signals in DNA sequences
Chapter 3
SOFTWARE REQUIREMENTS
SPECIFICATION
3.1 Introduction
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have
a well-defined nucleus and they have a single chromosome which is contained within a
nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a
well-defined nucleus with many chromosomes which are large and linear.
A gene consists of three regions, viz, a promoter region, a coding region and a terminator
region. Gene Recognition is a particularly difficult problem in Bioinformatics. The
complexity that the DNA sequences involve makes the task even more daunting. The
solution to this problem can be found to a certain extent by using Hidden Markov Models.
A Hidden Markov Model is a generalization of a Markov chain, in which each (“internal”)
state is not directly observable (hence the term hidden) but produces (“emits”) an observable
random output (“external”) state, also called “emission”, according to a given stationary
probability law. In this case, the time evolution of the internal states can be induced only
through the sequence of the observed output states. If the number of internal states is N, the
transition probability law is described by a matrix with N times N values; if the number of
emissions is M, the emission probability law is described by a matrix with N times M values.
A model is considered defined once given these two matrices and the initial distribution of
the internal states.
The most used algorithms in Hidden Markov Models are:
1. The Forward Algorithm: To find the probability of emission distribution (given a model)
starting from the beginning of the sequence.
2. The Backward Algorithm: find the probability of emission distribution (given a model)
starting from the end of the sequence.
3. Viterbi algorithm: To find the sequence of internal states that has, as a whole, the highest
probability. The most used algorithm is the Viterbi algorithm.
3.1.1 Purpose of the Project
The purpose of this project is to annotate the unannotated DNA sequences of various
species such as Saccharomyces cerevisiae, Homo sapiens etc.
3.1.2 Scope of the Project
This project aims to recognize the genes thus helping in better drug discovery and a
better understanding of the organisms.
3.2 General Description
3.2.1 Project Perspective
In this project we aim recognize genes in an unannotated sequence of DNA. The gene will be
recognized with the start and the stop codon appropriately marked.
3.2.2 End User Expectation
The output to be expected will be the recognized genomic sequence of a particular species
with all the parts of the gene properly identified such as the promoter, the introns, exons and
the terminator region.
3.2.3 General Constraints
We come across a large number of constraints in gene recognition. Since the start codon,
ATG, codes for the amino acid methionine it may appear even in the coding region. This
poses a problem in order to identify the start of the gene. The promoters help in identifying
the start of the gene. But, in eukaryotic organisms the promoter sequences are many and
quite complex. The stop codons, TAA, TGA and TAG, also occur in the intergenic region.
This also is considered as a constraint. Another constraint that we may come across is that
the model cannot be made to work for sequences of all the species.
3.2.4 Assumptions and Dependencies
The assumptions that we are making in our project is that the input sequence is complete and
does not contain any partial genes.
3.3 Specific Requirements
3.3.1 Functional Requirements
The functional requirements are as follows:
i) Take an unannotated input sequence and annotate it.
ii) Visualize the annotated output sequence.
3.3.2 Non-Functional Requirements
The performance required from the Forward and Backward Algorithm must be O( N2L)
where L is the length of the sequence and N stands for the number of states in the
model.
3.3.3 Software System Requirements
This gene recognition tool requires that the operating system be a Linux operating system
with Apache server.
3.3.4 Hardware System Requirements
Processor: Intel Pentium (1.8 GHz)
RAM: 1GB
Hard Disk: 80 GB
Cache: 2 MB L2 Cache
3.4 Interface Requirements
3.4.1 User Interface
We need one GUI for accepting the input sequence and more GUI for visualizing the output
annotated sequence.
Fig. 3.1 Use-case Diagram
Fig 3.2 Flowchart
3.5 Performance
The efficiency expected of the forward and backward algorithm is of the order of
O(N2L). At run time, warnings are issued when iterators follow an inconsistent or out-
of- bounds path, and when negative probabilities are encountered. Also, to improve the
efficiency, constant transition probabilities are pre-computed outside the main loop
whenever possible. Finally, all loops over transitions and states are unrolled, reducing
the number of run-time decisions and lookups.
Chapter 4
SYSTEM DESIGN
4.1 Introduction and Design Overview
This project deals with one of the most challenging problems in Bioinformatics i.e. Gene
Recognition. Gene prediction in DNA has become one of the foremost computational
biology problems for two reasons. Firstly, because completely sequenced genomes have
become readily available and most importantly, because of the need to extract actual
biological knowledge from this data to explain the molecular interactions that occur in cells
and to define important cellular pathways. Discovering the location of genes on the genome
is the first step towards building such a body of knowledge.
This document deals with the design of the architecture used for this project. The problem of
Gene Recognition has been solved by using numerous methods. But, by far, Hidden Markov
Models are most widely used. Hidden Markov Models do not present the problems that occur
while Gene Recognition is being performed by pattern recognition.
The input will be a DNA sequence in FASTA format that is not annotated. It then undergoes
preprocessing in order to be checked for the correct format and to be copied on to a
temporary file without the description of the sequence. After this is done, the sequence goes
through the computation phase where the Viterbi, forward-backward algorithms are used to
find the optimal path, hence finding the gene. This output is then displayed to the user.
4.2 System Architectural Design
4.2.1 Chosen System Architecture
Fig 4.1 System Architecture
The system has four components:1. Input Component
2. Preprocessing
3. Computational Component
4. Output Component
4.2.2 Discussion of Alternative Designs
One of the alternative designs that were proposed for the recognition of genes was that of
pattern matching using parallel programming. It involves the recognition of start codons,
promoter sequences, terminator codons and intrinsic terminator sequences in a given DNA
sequence, hence recognizing the gene. But this design failed due to several reasons and it was
also found to be an inefficient way of locating genes.
The first step involved identifying the terminator codons in parallel by dividing the input
sequence into a fixed number of nucleotides. But, the major drawback here was that certain
sequences in the intergenic region also matched the terminator codons.
Another major drawback is that, the start codon, ATG, also codes for the protein methionine.
Hence, the sequence ATG may be present in the coding region also. This makes it difficult to
recognize the start of the gene using pattern matching.
Another difficulty that was encountered was regarding the promoter sequences. Prokaryotic
genes have two fixed promoter sequences that mark the start of a gene. But, when eukaryotic
genes are considered the complexity of the promoter sequences increases. The sequences are
extremely diverse and difficult to characterize. They lie several kilobases away from the
transcriptional start site which makes it highly inefficient to search for it using pattern
matching. The nucleotides in the consensus sequence may also vary from gene to gene.
Hence, it becomes very difficult for the consensus sequence to be searched for in the given
DNA sequence.
The termination of a gene may also be identified using sequences known as intrinsic
terminators. The intrinsic terminators are a sequence of inverted repeat (5’ CAGTTA|
TAACTG 3’) followed by up to six thymine nucleotides (TTTTTT). But, the complexity
involved in this would be that the inverted repeat may be of any length.
Partial genes would also pose a problem to recognize a gene using pattern recognition.
Hence, we planned to use Hidden Markov Models for gene recognition.
4.3 Detailed Description of Components
4.3.1 Input Component
The input component takes a text file containing DNA sequences in FASTA format as input.
This acts as the sequence that has to be annotated.
Fig 4.2 Input Component
4.3.2 Preprocessing
The description in the input file is removed and the remaining input, which is the DNA
sequence, is copied onto another temporary file. If the user enters data is any format other
than FASTA then appropriate error messages are displayed.
Fig 4.3 Preprocessing
4.3.3 Computational Component
The computational component involved in this project is the application of Hidden Markov
Models. The preprocessed sequence is taken as input and the Viterbi algorithm and forward-
backward algorithm are applied to it. The Viterbi algorithm is a dynamic programming
algorithm for finding the most likely sequence of hidden states, called the Viterbi path, which
will generate a given output sequence given the model parameters. The forward-backward
algorithm is an inference algorithm for hidden Markov models which computes the posterior
marginals of all hidden state variables given a sequence of observations/emissions.
4.3.4 Output Component
This component takes the annotated sequence from the computational component and
displays the gene.
Fig 4.4 Output Component
4.4 User Interface Design
4.4.1 Description of the User Interface
The user interface for this project involves page where the user may input the DNA sequence
or may upload a file containing the sequence. The output is also displayed on the same page.
4.4.2 Screen Images
Fig 4.5 Input Screen Shot 1
The user can enter his DNA sequence in the box provided with the label “Enter the input
sequence in FASTA format”. If the user inputs a sequence that is not in FASTA format a
dialog box opens with an error message asking the user to input the sequence in the correct
format.
Fig 4.6 Input Screen Shot 2
If the user wants to upload a file containing the input sequence instead of directly inputting
the sequence he may do so by clicking the “Upload File” button. This opens a box wherein
the user may type the name of the file that contains the input sequence.
Fig 4.4 Output Screen Shot
Once the user inputs the sequence he may click “Submit” to obtain the annotated sequence in
the box provided below “The annotated sequence is”.
4.4.3 Objects and Actions
The objects present on the user interface are:
Text Area1: To enter the input sequence directly. The text area is preceded by a statement
“Enter the input sequence in FASTA format”.
“Upload File” button: If the user does not want to input the DNA sequence directly he may
use this button to upload a file that contains the input sequence. When this button is clicked
a text area appears where the user may enter the file name.
“Submit” button: This button may be clicked when the sequence or the file has to be
submitted for annotating the sequence that they contain. Once this button is clicked the
sequence gets annotated and the output is displayed.
Text Area2: This text area is used to display the output i.e. the annotated sequence. It is
preceded by “The annotated output sequence is”.
“Exit” button: This button is used to exit the screen.
4.5 Test Plan
4.4.1 Features to be Tested
The features to be tested are as follows:
Features to be
tested
Input Expected Output
Input in FASTA format Sequence entered by the user No- Error messages should
be displayed
Yes- Proceed
Preprocessing Sequence entered by the user No- Error
Yes- Temporary file
containing only the sequence
should be created
Start codon identification Temporary file with
sequence
ATG sequence should be
identified at the proper
position
Terminator codon
identification
Temporary file with
sequence
TAA/TAG/TGA should be
identified at the correct
position
Exons and Introns
identification
Temporary file with
sequence
Exons and introns should
follow the AG-GT rule and
be identified at the proper
position
Table 4.1 Features to be Tested
Chapter 5
IMPLEMENTATION
5.1 Hidden Markov Models in Gene Recognition
A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties
of observed real world data. A good HMM accurately models the real world source of the
observed data and has the ability to simulate the source. Machine Learning techniques based
on HMMs have been successfully applied to problems including speech recognition, optical
character recognition, and problems in computational biology. The main computational
biology problems with HMM-based solutions are protein family profiling, protein binding
site recognition and the problem that is the topic of this paper, gene finding in DNA. Gene
finding or gene prediction in DNA has become one of the foremost computational biology
problems for two reasons. Firstly, because completely sequenced genomes have become
readily available and most importantly, because of the need to extract actual biological
knowledge from this data to explain the molecular interactions that occur in cells and to
define important cellular pathways. Discovering the location of genes on the genome is the
first step towards building such a body of knowledge.
A basic Markov model of a process is a model where each state corresponds to an observable
event and the state transition probabilities depend only on the current and predecessor state.
This model is extended to a Hidden Markov model for application to more complex
processes, including speech recognition and computational gene finding. A generalized
Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output
symbols, a set of state transition probabilities and a set of emission probabilities. The
emission probabilities specify the distribution of output symbols that may be emitted from
each state. Therefore in a hidden model, there are two stochastic processes; the process of
moving between states and the process of emitting an output sequence.
The problem of finding genes in DNA has been studied for many years. It was one of the first
problems tackled once sufficient genomic data became available. The problem is given a
sequence of DNA, determine the locations of genes, which are the regions containing
information that code for proteins. At a very general level, nucleotides can be classified as
belonging to coding regions in a gene, non-coding regions in a gene or intergenic regions.
The problem of gene finding can then be stated as follows:
Input: A sequence of DNA X = (x1….xn) € ∑*, where ∑= A,C, G, T.
Output: Correct labeling of each element in X as belonging to a coding region, non-coding
region or intergenic region.
Gene finding becomes complicated when the problem is approached in more biological
detail. A eukaryotic gene contains coding regions called exons which may be interrupted by
non-coding regions called introns. The exons and introns are separated by splice sites.
Regions outside genes are called intergenic. The goal of gene finding is then to annotate the
sets of genomic data with the location of genes and within these genes, specific areas such as
promoter regions, introns and exons.
The Hidden Markov Model that we have used in our project is a shown below:
Start Background Start Codon Gene Stop Endfull bgstart fullgenestop
extend
bgend
full
bgbg
emit background emit start emit codon emit stop empty
BLOCK 1 BLOCK 2 BLOCK 3
Fig 5.1 Gene Model
The Gene Model shown above consists of six states, viz, Start, Background, Start Codon,
Gene, Stop and End. These six states are divided into three blocks, viz, Block 1, Block 2 and
Block 3. Block 1 consists of just the Start state which is the start of scanning the sequence.
Block 2 consists of 4 states, Background, Start Codon, Gene and Stop. Block 3 consists of a
single state, End. These states constitute the finite set of states of the Hidden Markov Model.
The alphabet of output symbols in this HMM are the nucleotides A, T, G, and C.
The sequence moves from one state to another using state transition probabilities. The
transition from the Start state to the Background State has ‘full’ probability, i.e. 1.0. The
Background state has emission probability ‘emitbackground’. The Background state takes
care of the intergenic region. Hence the nucleotides encountered may be A, T, G or C.
Therefore, the emission probability for Background is emitbackground= 0.25. The model
remains in the Background state until a start codon is encountered. The transition probability
to remain in the Background state is ‘bgbg’ i.e. full-bgend-bgstart. Once the start codon ATG
is found, the model moves from Background state to Start Codon state.
The transition from Background state to Start Codon state has the probability ‘bgstart= Gene
Density’. The emission probability of Start Codon state is ‘emitstart’. If an ATG is
encountered then the emission probability is 1.0 otherwise it is considered as 0.0. Once a start
codon is encountered it leads to a gene, hence the model moves from the Start Codon state to
the Gene state. The transition probability for this transition is ‘full’ i.e. 1.0 since once the
start codon is found, the gene is found. There is more than a single codon that is encountered
when we are in the Gene state. Hence, the probability of staying in the Gene state is ‘extend=
1.0- (1.0/ Gene Length)’.
The emission probability for the Gene state is ‘emitcodon’. This state emits any of the 61
codons (except TAA, TGA and TAG) that code for the amino acids that form the protein.
Hence, if these codons are found the probability is 1/61 otherwise if stop codons are found
the emission probability is 0.0. Then, a transition to the next state is made.
The next state is the Stop Codon state which is entered if any of the stop codons (TAA, TAG
or TGA) are encountered. The transition probability from Gene state to the Stop state is
‘genestop= 1.0/ Gene Length’. Once the stop codons are encountered it marks the end of that
particular gene. But the DNA sequence may have more nucleotides. Hence, the transition
takes place from Stop state to Background state. This transition probability is ‘full’ i.e. 1.0.
The Stop state emits the stop codons, hence the emission probability of the state is 1/3 if a
stop codon is found otherwise it is 0.0.
Once the transition has taken place from Stop to Background the entire sequence is scanned
and finally a transition from Background to End state takes place with a transition probability
of ‘bgend=0.0001’. The End state belongs to Block 3 of the model and its emission is
‘empty’.
The algorithms used in the implementation of this project are Forward-Backward Algorithm
and the Viterbi Algorithm.
The forward-backward algorithm is an inference algorithm for hidden Markov models which
computes the posterior marginals of all hidden state variables given a sequence of
observations/emissions , i.e. it computes, for all hidden state variables
, the distribution . The algorithm makes use of the
principle of dynamic programming to efficiently compute the values that are required to
obtain the posterior marginal distributions in two passes. The first pass goes forward in time
while the second goes backward in time; hence the name forward-backward algorithm.
The forward algorithm is used to find the next likely state in the given finite set of states. In
our implementation the forward algorithm is implemented in a function called Forward
which returns the next likely state. In this function the 8 transitions are defined according to
our gene model. It scans the sequence in the forward direction and finds the likely states.
The backward algorithm is also used to find the next likely state in a given finite set of states
but it scans the sequence in the reverse direction. In our implementation the backward
algorithm has been implemented in the function called Backward. This function also returns
the next likely state.
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely
sequence of hidden states, called the Viterbi path, that results in a sequence of observed
events, especially in the context of Markov information sources, and more generally, hidden
Markov models.
The algorithm makes a number of assumptions.
First, both the observed events and hidden events must be in a sequence. This
sequence often corresponds to time.
Second, these two sequences need to be aligned, and an instance of an observed event
needs to correspond to exactly one instance of a hidden event.
Third, computing the most likely hidden sequence up to a certain point t must depend
only on the observed event at point t, and the most likely sequence at point t − 1.
In the implementation we have implemented Viterbi algorithm in 2 functions, Viterbi_trace
and Viterbi_recurse. Viterbi_trace finds the most likely path in the forward direction whereas
Viterbi_recurse finds the path in the reverse direction. These two functions return the most
likely path followed by the sequence. Another function called addEdge is used to assign an
edge from the ‘from’ state to the ‘to’ state with transitions, probability and emissions as
parameters.
Chapter 6
TESTING
6.1 Introduction
6.1.1 System Overview
The implementation of our code has been done on the Linux operating system. The language
that has been used for coding is C++ . The front end has been developed using PHP.
6.1.2 Test Approach
The input sequences have been taken from the GENBANK database. The output is tested
manually in comparison to the details from the GENBANK database.
6.2 Test Cases
6.2.1 Case-1
6.2.1.1 Purpose
To test if the gene has been correctly identified with the start and stop codon appropriately
marked.
6.2.1.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a
file of .fasta format. The sequence is:
>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACA
TGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAA
GCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACA
ACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATT
AGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCG
TCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGT
ACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCAC
AACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAA
CTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGA
AGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATC
TACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTAC
TATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCC
CCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGT
AGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAG
TTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTC
AATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTT
GTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTAT
GGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGAC
CGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCG
CCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATA
AACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT
TCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAAT
AGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTAT
CTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGG
GTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAAT
CCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTG
TCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGT
TCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTAC
TAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAG
TGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAG
CTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACG
TCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAA
ATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTT
CATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAG
TAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTA
CCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTAC
ACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATT
GGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGA
TGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAG
AAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGG
TCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTG
TCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACT
TTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGG
ACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAAC
GTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAAT
CACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTT
TTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAA
ATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGA
TTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATAC
CCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTT
TGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAA
ATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAA
CCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCT
TCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCT
GTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTG
TCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCG
TCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTA
AAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGT
GCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTAT
TAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGG
TATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCT
ACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCG
CAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAAC
AGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAA
ATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTT
ATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAG
AATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAAT
AACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCAT
CACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAAT
CATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGA
ATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGC
TCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAG
CTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTT
CAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTT
TTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTT
CTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTG
ATC
6.2.1.3 Expected Output & Pass/ Fail Criteria
The expected output is:
Gene Recognition: (upper case represents the Gene/s)
gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacga
gcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatattta
ggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaaagacgcgaaaaaaaaagaacaa
cgcgtcatagaacttttggcaattcgcgtcacaaataaattttggcaacttatgtttcctcttcgagcagtactcgagccctgtctcaagaatgtaataatacccatcgta
ggtatggttaaagatagcatctccacaacctcaaagctccttgccgagagtcgccctcctttgtcgagtaattttcacttttcatatgagaacttattttcttattctttactct
cacatcctgtagtgattgacactgcaacagccaccatcactagaagaacagaacaattacttaatagaaaaattatatcttcctcgaaacgatttcctgcttccaacatc
tacgtatatcaagaagcattcacttaccATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACT
CCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAA
GAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAG
CTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTT
CTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTC
GAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGT
CCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAAC
GGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTC
ACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAAT
TGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATT
GCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAG
GTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC
AACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGAT
CCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGAT
AATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTT
TCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGG
ATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTT
TTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCA
AGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATT
TCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACA
TCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGT
TCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCT
CCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATA
AAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTT
GCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATT
AGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAA
CCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAA
CACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAG
ATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAG
CAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGT
ATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCT
GATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGA
AGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACA
GCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCAT
CGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAAC
AATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCAT
AGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATG
TCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGAttatacgcaacgatattttgctt
aattttattttcctgttttattttttattagtggtttacagataccctatattttatttagtttttatacttagagacatttaattttaattccattcttcaaatttcatttttgcacttaaaa
caaagatccaaaaatgctctcgccctcttcatattgagaatacactccattcaaaattttgtcgtcaccgctgattaatttttcactaaactgatgaataatcaaaggcccc
acgtcagaaccgactaaagaagtgagttttattttaggaggttgaaaaccattattgtctggtaaattttcatcttcttgacatttaacccagtttgaatccctttcaatttctg
ctttttcctccaaactatcgaccctcctgtttctgtccaacttatgtcctagttccaattcgatcgcattaataactgcttcaaatgttattgtgtcatcgttgactttaggtaatt
tctccaaatgcataatcaaactatttaaggaagatcggaattcgtcgaacacttcagtttccgtaatgatctgatcgtctttatccacatgttgtaattcactaaaatctaaa
acgtatttttcaatgcataaatcgttctttttattaataatgcagatggaaaatctgtaaacgtgcgttaatttagaaagaacatccagtataagttcttctatatagtcaatta
aagcaggatgcctattaatgggaacgaactgcggcaagttgaatgactggtaagtagtgtagtcgaatgactgaggtgggtatacatttctataaaataaaatcaaat
taatgtagcattttaagtataccctcagccacttctctacccatctattcataaagctgacgcaacgattactattttttttttcttcttggatctcagtcgtcgcaaaaacgta
taccttctttttccgaccttttttttagctttctggaaaagtttatattagttaaacagggtctagtcttagtgtgaaagctagtggtttcgattgactgatattaagaaagtgg
aaattaaattagtagtgtagacgtatatgcatatgtatttctcgcctgtttatgtttctacgtacttttgatttatagcaaggggaaaagaaatacatactattttttggtaaag
gtgaaagcataatgtaaaagctagaataaaatggacgaaataaagagaggcttagttcatcttttttccaaaaagcacccaatgataataactaaaatgaaaaggattt
gccatctgtcagcaacatcagttgtgtgagcaataataaaatcatcacctccgttgcctttagcgcgtttgtcgtttgtatcttccgtaattttagtcttatcaatgggaatc
ataaattttccaatgaattagcaatttcgtccaattctttttgagcttcttcatatttgctttggaattcttcgcacttcttttcccattcatctctttcttcttccaaagcaacgatc
cttctacccatttgctcagagttcaaatcggcctctttcagtttatccattgcttccttcagtttggcttcactgtcttctagctgttgttctagatcctggtttttcttggtgtagt
tctcattattagatctcaagttattggagtcttcagccaattgctttgtatcagacaattgactctctaacttctccacttcactgtcgagttgctcgtttttagcggacaaag
atttaatctcgttttctttttcagtgttagattgctctaattctttgagctgttctctcagctcctcatatttttcttgccatgactcagattctaattttaagctattcaatttctctttg
atc
The gene should been correctly identified with the ATG and the stop codon TGA
appropriately marked.
6.2.1.4 Test Procedure
The input file containing the DNA sequence is entered in the space provided and the program
is run. The output that is obtained is matched with the GENBANK data to verify if the gene
has been correctly identified.
6.2.1.5 Test Results
The gene has been correctly identified by the program with the ATG and TGA correctly
marked as verified from the GENBANK record.
6.2.2 Case-2
6.2.2.1 Purpose
To test if the gene has been correctly identified with the start and stop codon appropriately
marked when the sequence ends with the gene.
6.2.2.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a
file of .fasta format. The sequence is:
>gi|296148533|ref|NM_001183937.1| Saccharomyces cerevisiae S288c Rny1p (RNY1) mRNA, complete cds
ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT
ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG
TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA
AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG
ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA
TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA
AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA
TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG
GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC
AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA
TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT
GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG
AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT
TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC
GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG
GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA
GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA
ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT
TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC
TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
6.2.2.3 Expected Output & Pass/ Fail Criteria
The expected output is:
Gene Recognition: (upper case represents the Gene/s)
ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT
ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG
TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA
AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG
ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA
TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA
AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA
TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG
GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC
AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA
TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT
GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG
AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT
TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC
GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG
GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA
GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA
ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT
TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC
TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
The entire sequence should be in capitals since it is the entire gene.
6.2.2.4 Test Procedure
The input file containing the DNA sequence is entered in the space provided and the program
is run. The output that is obtained is matched with the GENBANK data to verify if the gene
has been correctly identified.
6.2.2.5 Test Results
The gene has not been correctly identified by the program with the ATG and TAA marked as
verified from the GENBANK record. Hence, this is a fail case.
Chapter 7
CONCLUSION & FUTURE ENHANCEMENTS
In the previous chapters are described the model that we have used to find genes in the given
DNA sequences. Our algorithm has been tested on Saccharomyces cerevisiae, Drosophila
melanogaster, Zea Mays (maize) and Homo Sapiens sequences. It has been found to work
partially on the sequences of Homo Sapiens and achieve about 80-90% success for the other
sequences.
This field has a lot of future work to be done. The genes of many other organisms still need
to be identified and with greater efficiency. The partial genes also have not been able to be
identified using this program and a significant future enhancement would be to be able to
identify partial genes. Multiple genes have been taken care of to a certain extent. Also, the
recognition of promoter sequences is a daunting task and future enhancements can also take
care of this. The introns and exons also need to be identified with greater efficiency.
Chapter 8
BIBLIOGRAPHY & REFERENCES
[1] Rajeev K. Azad, Mark Borodovsky, Probabilistic methods of identifying genes in
prokaryotic genomes: Connections to the HMM theory, Henry Stewart Publications 1477-
4054. Briefings in bioinformatics. Vol 5. No 2. 118–130. June 2004
[2] Alexandre Lomsadze, Vardges Ter-Hovhannisyan, Yury O. Chernoff, Mark Borodovsky,
Gene identification in novel eukaryotic genomes by self-training algorithm, 6494–6506
Nucleic Acids Research, 2005, Vol. 33, No. 20
[3] David Sankoff, The Early Introduction Of Dynamic Programming Into Computational
Biology, Oxford University Press, 2000, Vol 16 no1, Pages 41-47
[4] P.S. Novichkov, M.S. Gelfand, A.A. Mironov, Gene Recognition in Eukaryotic DNA by
Comparison of Genomic Sequences, Oxford University Press, 2001, Vol 17 no11, Pages
1011-1018
[5] Chris Burge, Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic
DNA, J. Mol. Biol. (1997) 268, 78-94
[6] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White, Microbial gene
identification using interpolated Markov models, Nucleic Acids Research, 1998, Vol. 26, No.
2, 544-548
[7] Sanja Rogic, B.F. Francis Ouellette, Alan K. Mackworth, Improving gene recognition
accuracy by combining predictions from two gene-finding programs, Oxford University
Press, 2002, Vol 18 no8, Pages 1034-1045
[8] Andrey A. Mironov, Pavel S. Novichkov, Mikhail S. Gelfand, Pro-Frame: similarity-
based gene recognition in eukaryotic DNA sequences with errors, Oxford University Press,
2001, Vol 17 no 1, Pages 13-15
[9] Y.V. Kondrakhin, A.E. Kel, N.A. Kolchanov, A.G. Romashchenko, L. Milanesi,
Eukaryotic Promoter Recognition by Binding Sites for Transcription Factors, Computer
Applications in Biosciences 11: 477-488, 1995
[10] Anton M.Shmatkov, Arik A.Melikyan, Felix L.Chernousko, Mark Borodovsky, Finding
Prokaryotic Genes by the ‘frame-by-frame’ Algorithm: Targeting Gene Starts and
Overlapping Genes, Bioinformatics, 15: 874-886, 1999
[11] Peter M. Hooper, Haiyan Zhang, David S. Wishart, Prediction of Genetic Structure in
Eukaryotic DNA using Reference Point Logistic Regression and Sequence Alignment,
Computer Applications in Biosciences, 16: 425 – 438, 2000
[12] Christopher Burge, Identification Of Genes In Human Genomic DNA, March 2007
[13] Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human
DNA, 308-761 Machine Learning Projec,t Kaleigh Smith, January 17, 2002
[14] Ion I. Mandoiu and Alexander Zelikovsky, Bioinformatics Algorithms-Techniques and
Applications, John Wiley & Sons, 2008
[15] Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms, The
MIT Press, 2004
[16] David W. Mount, Bioinformatics- Sequence and Genome Analysis, Cold Spring Harbor
Laboratory Press
[17] Valeria De Fonzo, Filippo Aluffi-Pentini, Valerio Parisi, Hidden Markov Models in
Bioinformatics, Current Bioinformatics, 2007, 2, 49-61
[18] Catherine Mathe, Marie-France Sagot, Thomas Schiex, Pierre Rouze, Current methods
of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30
No. 19 4103-4117