Gene Recognition

Gene Recognition

A project report submitted to

M S Ramaiah Institute of TechnologyAn Autonomous Institute, Affiliated to

Visvesvaraya Technological University, Belgaum

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

Submitted by

Mudra Hegde 1MS07CS052Nakul G V 1MS07CS053

Under the guidance of

Veena G SAssistant Professor

Computer Science and EngineeringM S Ramaiah Institute of Technology

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

M.S.RAMAIAH INSTITUTE OF TECHNOLOGY(Autonomous Institute, Affiliated to VTU)

BANGALORE-560054www.msrit.edu

May 2011

http://www.msrit.edu/

Gene Recognition

A project report submitted to

M. S. Ramaiah Institute of TechnologyAn Autonomous Institute, Affiliated to

Visvesvaraya Technological University, Belgaum

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

Submitted by

Mudra Hegde 1MS07CS052Nakul G V 1MS07CS053

Under the guidance of

Veena G SAssistant Professor

Computer Science and EngineeringM S Ramaiah Institute of Technology

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

M. S. RAMAIAH INSTITUTE OF TECHNOLOGY(Autonomous Institute, Affiliated to VTU)

BANGALORE-560054www.msrit.edu

May 2011

http://www.msrit.edu/

Department of Computer Science & Engineering

M. S. Ramaiah Institute of Technology(Autonomous Institute, Affiliated to VTU)

BANGALORE-560054

CERTIFICATE

This is to certify that the project work titled Gene Recognition is carried out by

1MS07CS052 Mudra Hegde and 1MS07CS053 Nakul G V in partial fulfillment for the

award of degree of Bachelor of Engineering in Computer Science and Engineering during the

year 2011. The Project report has been approved as it satisfies the academic requirements

with respect to the project work prescribed for Bachelor of Engineering Degree. To the best

of our understanding the work submitted in this report has not been submitted, in part or full,

for the award of any diploma or degree of this or any other University.

Veena G S (Dr.R. Selvarani) Guide Head, Dept of CSE

(External Examiner)

Department of Computer Science & Engineering

M. S. Ramaiah Institute of Technology(Autonomous Institute, Affiliated to VTU)

BANGALORE-560054

DECLARATION

We hereby declare that the entire work embodied in this report has been carried out by us at

M. S. Ramaiah Institute of Technology under the supervision of Veena G S. This report has

not been submitted in part or full for the award of any diploma or degree of this or any other

University.

1MS07CS052 MUDRA HEGDE1MS07CS053 NAKUL G V

Abstract

A gene consists of three regions, viz, a promoter region, a coding region and a terminator

region. Prokaryotic genes are relatively easy to find compared to Eukaryotic genes because

they lack introns. Genes that are expressed usually have introns that interrupt the coding

sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in

mature mRNA (exons) interrupted by introns. The recognition of the promoter regions in a

eukaryotic genome is a daunting task. There is a lot of sequencing data that has been

generated and they need to be annotated. It helps in having a better understanding of the

organism, in drug discovery and also for finding a cure for various genetic disorders.

Gene Finding or gene prediction in DNA has become one of the foremost computational

biology problems for two reasons. Firstly, because completely sequenced genomes have

become readily available and most importantly, because of the need to extract actual

biological knowledge from this data to explain the molecular interactions that occur in cells

and to define important cellular pathways. Discovering the location of genes on the genome

is the first step towards building such a body of knowledge.

In our work we try to find the coding and non-coding regions of an unlabeled string of DNA

nucleotides using Hidden Markov model. A Hidden Markov Model is a generalization of a

Markov chain, in which each (“internal”) state is not directly observable (hence the term

hidden) but produces (“emits”) an observable random output (“external”) state, also called

“emission”, according to a given stationary probability law. HMM employs dynamic

programming algorithms like Viterbi, Forward-Backward algorithms to aid in gene

recognition.

i

ACKNOWLEDGEMENT

We consider ourselves privileged to express gratitude and respect towards all those who

guided us through the completion of this project.

Firstly, we would like to thank Dr. R. Selva Rani, Head of Department, Department of

Computer Science and Engineering, MSRIT, Bangalore for giving us the opportunity to do

this project and also for the excellent lab facilities provided.

We would like to express our gratitude to our project guide, Mrs. Veena G.S; we are

privileged to experience a sustained enthusiastic and involved interest from her side. We

would also like to express our gratitude to Dr K.G Srinivasa, Professor, Department of

Computer Science and Engineering for his support.

We would also like to sincerely thank Mr. Shashidhara H S, Associate Professor, Information

Science, for all the additional inputs he has given and also helped us gather more information

on the various aspects involved in this project. We would like to thank the faculty of the

Department of Computer Science and Engineering and the institute for extending a helping

hand at every juncture of need and making this possible.

Mudra Hegde

Nakul G.V

ii

LIST OF FIGURES

Figure 3.1 Use-case Diagram........................................................................................ 14

Figure 3.2 Flowchart..................................................................................................... 15

Figure 4.1 System Architecture..................................................................................... 18

Figure 4.2 Input Component.......................................................................................... 20

Figure 4.3 Preprocessing.................................................................................................21

Figure 4.4 Output Component........................................................................................22

Figure 4.5 Input Screen Shot1.........................................................................................23

Figure 4.6 Input Screen Shot2.........................................................................................24

Figure 4.7 Output Screen Shot........................................................................................25

Figure 5.1 Gene Model................................................................................................... 30

iii

Contents

Abstract i

Acknowledgements ii

List of Figures iii

Contents iv

1 Introduction

1.1 General Introduction………………………………………………011.2 Statement of the Problem………………………………………….011.3 Objectives of the project…………………………………………....021.4 Current Scope……………………………………………………...021.5 Future Scope……………………………………………………....02

2 Literature Survey

2.1 Prokaryotic Gene Structure……………………………………….032.2 Eukaryotic Gene Structure………………………………………..052.3 Hidden Markov Models…………………………………………...072.4 GENSCAN Algorithm…………………………………………....09

3 Software Requirements Specification

3.1 Introduction………….3.1.1 Purpose…….3.1.2 Scope of the Project…..

3.2 General Description….3.2.1 Project Perspective…3.2.2 End User Expectation….3.2.3 General Constraints…3.2.4 Assumptions and Dependencies…

3.3 Specific Requirements3.3.1 Functional Requirements…3.3.2 Non Functional Requirements….3.3.3 Software Requirements…3.3.4 Hardware Requirements..

3.4 Interface Requirements…3.4.1 User Interface…

3.5 Performance…

4 System Design4.1 Introduction and Design Overview4.2 System Architectural Design

4.2.1 Chosen System Architecture4.2.2 Discussion of Alternative Designs

4.3 Detailed Description of Components4.3.1 Input Component4.3.2 Preprocessing4.3.3 Computational Component4.3.4 Output Component

4.4 User Interface Design4.4.1 Description of User Interface4.4.2 Screen Images4.4.3 Objects and Actions

4.5 Test Plan4.5.1 Features to be Tested

5 Implementation5.1 Hidden Markov Models in Gene Recognition

6 Testing6.1 Introduction

6.1.1 System Overview6.1.2 Test Approach

6.2 Test Cases6.2.1 Case I

6.2.1.1 Purpose6.2.1.2 Input6.2.1.3 Expected Output & Pass/ Fail Criteria6.2.1.4 Test Procedure6.2.1.5 Test Results

6.2.2 Case II6.2.2.1 Purpose6.2.2.2 Input6.2.2.3 Expected Output & Pass/ Fail Criteria6.2.2.4 Test Procedure6.2.2.5 Test Results

7 Conclusion & Scope for Future Work8 Bibliography & References

Chapter 1

INTRODUCTION

1.1 General introduction

Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have

a well-defined nucleus and they have a single chromosome which is contained within a

nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a

well-defined nucleus with many chromosomes which are large and linear.


region. A promoter is a region of DNA that facilitates the transcription of a particular gene.

Promoters are located near the genes they regulate, on the same strand and typically upstream

(towards the 5' region of the sense strand). In order for the transcription to take place, the

enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a

gene. Promoters contain specific DNA sequences and response elements which provide a

secure initial binding site for RNA polymerase and for proteins called transcription factors

that recruit RNA polymerase.

The coding region of a gene is that portion of a gene's DNA or RNA, composed of exons,

that codes for protein. The region is bounded nearer the 5' end by a start codon and nearer the

3' end with a stop codon. The coding region in mRNA is bounded by the five prime

untranslated region and the three prime untranslated region, which are also parts of the exons.

1.2 Statement of the Problem

This project aims at finding the annotations of a genomic sequence i.e the coding regions of a

gene given an input DNA sequence which is not annotated.

1.3 Objective of the Problem

To find the coding and non-coding regions of an unlabelled string of DNA nucleotides.

1.4 Current Scope

Gene Recognition is still in the stages of research and certain algorithms like GENSCAN and

GrailEXP implement it with certain constraints.

1.5 Future Scope

The biggest challenge that the field of bioinformatics is facing is the huge amount of

unannotated data that it has to deal with. It has huge databases of sequences with it but does

not know what they represent. So, if the genes are annotated it can lead to a lot of mind-

boggling inventions. Newer and better drugs can be developed for various genetic disorders.

Any hereditary diseases can be detected in the offspring at a very early stage. It will also lead

to the better understanding of various organisms.

Chapter 2

LITERATURE SURVEY

2.1 Prokaryotic Gene Structure

The organisms can be broadly classified into Prokaryotes and Eukaryotes. Prokaryotes are

organisms that lack nucleus and membrane-bound organelles. Prokaryotes have a single

chromosome, contained within a nucleoid region rather than a membrane-bound nucleus, but

may also have various small circular pieces of DNA called plasmids spread throughout the

cell. Their gene structure is much simpler than the gene structure of Eukaryotic DNA.

The gene is the functional unit of the DNA. A gene is a unit of heredity in a living organism.

It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain

that has a function in the organism. All living things depend on genes, as they specify all

proteins and functional RNA chains. Genes hold the information to build and maintain an

organism's cells and pass genetic traits to offspring. A modern working definition of a gene is

"a locatable region of genomic sequence, corresponding to a unit of inheritance, which is

associated with regulatory regions, transcribed regions, and or other functional sequence

regions ". The prokaryotic gene is made up of three regions viz.

Promoter Region

Coding Region

Terminator Region

A promoter region is the part of the gene that facilitates the transcription process. Promoters

are located near the genes that they regulate, on the same strand and upstream (5’ region of

the sense strand). For transcription to take place, the enzyme, RNA polymerase has to bind to

a location near the gene. The promoter region contains specific DNA sequences that form the

binding site for the RNA polymerase. The promoter region in the prokaryotes consists of two

sequences. The first one, known as the Pribnow box, is the sequence of six nucleotides

TATAAT. The other sequence consists of the seven nucleotides TTGACAT.

The coding region is the exons (prokaryotic genes are devoid of introns) which form the part

of the DNA that is translated. The coding region starts with the initiation codon (ATG) and

end with the termination codon (TAG or TAA or TGA).

The terminator region is the region that marks the end of the gene or the operon on genomic

DNA for transcription (An operon is a functioning unit of the genomic material containing a

cluster of genes under the control of a single promoter).

Prokaryotic genes generally overlap with each other which make the detection of translation

initiation sites and the predictions of prokaryotic genes difficult.

Many gene finding programs for the prokaryotic genes have been developed, the earlier ones

being ECOPARSE, ORPHEUS, GeneMark.hmm and the more recent ones such as

GeneMark, GeneMarkS, EasyGene and GLIMMER. The GeneMark uses the Bayesian

method. GeneMark.hmm uses the modification of the Viterbi algorithm of HMM (Hidden

Markov Model) with duration to identify the most likely global path through hidden

functional states given the DNA sequence. The extension of the GeneMark algorithm

GeneMark.fba uses the forward-backward algorithm for local posterior decoding used in the

HMM theory. Given the DNA sequence, this program determines an a posteriori probability

for each nucleotide to belong to coding or non-coding region. Also, for any open reading

frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden

states that traverse the ORF as a coding region. Markov models of several orders were

combined in the ‘interpolated’ model for gene prediction in the Glimmer algorithm.

2.2 Eukaryotic Gene Structure

The eukaryotic organism, as opposed to a prokaryotic organism, has a well-defined nucleus

and membrane-bound organelles. In contrast to the prokaryotes, the eukaryotes have many

chromosomes that are large and linear. The chromosomes in eukaryotes are also packaged by

proteins into a structure known as the chromatin due to which very long chromosomes fit

into the nucleus.

The Eukaryotic gene is much more complex than a prokaryotic gene. The eukaryotic gene

consists of

Promoter Region

Exons

Introns

Terminator Region

The Start Site contains a sequence of 7 bases (TATAAAA) called the TATA box. The basal

or core promoter is found in all protein-coding genes. This is in sharp contrast to the

upstream promoter whose structure and associated binding factors differ from gene to gene.

Many different genes and many different types of cells share the same transcription factors

— not only those that bind at the basal promoter but even some of those that bind upstream.

What turns on a particular gene in a particular cell is probably the unique combination of

promoter sites and the transcription factors that are chosen.

Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically

lie upstream of the gene and can have regulatory elements several kilobases away from the

transcriptional start site (enhancers). In eukaryotes, the transcriptional complex can cause the

DNA to bend back on itself, which allows for placement of regulatory sequences far from the

actual site of transcription. Many eukaryotic promoters, between 10 and 20% of all genes,

contain a TATA box (sequence TATAAA), which in turn binds a TATA binding protein

which assists in the formation of the RNA polymerase transcriptional complex. The TATA

box typically lies very close to the transcriptional start site (often within 50 bases).

Eukaryotic promoter regulatory sequences typically bind proteins called transcription

factors which are involved in the formation of the transcriptional complex.

An exon is a nucleic acid sequence that is represented in the mature form of

an RNA molecule after either portions of a precursor RNA (introns) have been removed

by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-

splicing. The mature RNA molecule can be a messenger RNA or a functional form of a non-

coding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the

sequence in the DNA or its RNA transcript.

Genes that are expressed usually have introns that interrupt the coding sequences. A typical

eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called

exons) interrupted by introns. The regions between genes are likewise not expressed, but may

help with chromatin assembly, contain promoters, and so forth.

Intron sequences contain some common features. Most introns begin with the sequence GT

(GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among

them. Intron sequences may be large relative to coding sequences; in some genes, over 90

percent of the sequence between the 5′ and 3′ ends of the mRNA is introns. RNA polymerase

transcribes intron sequences. This means that eukaryotic mRNA precursors must be

processed to remove introns as well as to add the caps at the 5′ end and polyadenylic acid

(poly A) sequences at the 3′ end.

2.3 Hidden Markov Models

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties

of observed real world data. A good HMM accurately models the real world source of the

observed data and has the ability to simulate the source. Machine Learning techniques based

on HMMs have been successfully applied to problems including speech recognition, optical

character recognition, and problems in computational biology. The main computational

biology problems with HMM-based solutions are protein family profiling, protein binding

site recognition and the gene finding in DNA.

Gene Finding or gene prediction in DNA has become one of the foremost computational






A basic Markov model of a process is a model where each state corresponds to an observable

event and the state transition probabilities depend only on the current and predecessor state.

This model is extended to a Hidden Markov model for application to more complex

processes, including speech recognition and computational gene finding. A generalized

Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output

symbols, a set of state transition probabilities and a set of emission probabilities. The

emission probabilities specify the distribution of output symbols that may be emitted from

each state. Therefore in a hidden model, there are two stochastic processes; the process of

moving between states and the process of emitting an output sequence. The sequence of state

transitions is a hidden process and is observed through the sequence of emitted symbols.

The field of computational biology involves the application of computer science theories and

approaches to biological and medical problems. Computational biology is motivated by

newly available and abundant raw molecular datasets gathered from a variety of organisms.

Though the availability of this data marks a new era in biological research, it alone does not

provide any biologically significant knowledge. The goal of computational biology is then to

elucidate additional information regarding protein coding, protein function and many other

cellular mechanisms from the raw datasets. This new information is required for drug design,

medical diagnosis, medical treatment and countless fields of research.

Many effective tools based on HMMs have been created for the purpose of gene finding.

Among the most successful tools are Genie, GeneID and HMMGene. Though each tool has a

slightly different model, they each use the technique of combining several specialized

submodels into a larger framework. The submodels correspond directly to different regions

of DNA defined according to their function in the process of gene transcription. Most of the

gene finding tools is hybrid models that include neural network components. In these tools,

instead of an HMM, a neural network models certain regions, such as splice sites. The overall

framework of an HMM-based gene finder combines the submodels into a larger model

corresponding to the organization of a gene in DNA and its functional roles.

2.4 GENSCAN Algorithm

Identifying genes in DNA sequences by computational methods is a topic on which a lot of

research has been made in the past few years. This problem deals with the precise sequence

determinants of transcription, translation and RNA splicing. Softwares for exon prediction

have become common in genome sequencing laboratories to identify genes in newly

sequenced regions.

Early approaches to the gene recognition concentrated on the prediction of individual

functional elements such as promoters, coding regions, splice sites and so on. But, the recent

approaches to gene finding focus on integrating all these factors. Some examples of such

approaches are: FGENEH, GENMARK, Gene ID, Genie, GeneParser and GRAIL II. Two

important limitations of the currently existing algorithms are that they make an assumption

that the input sequence has exactly one gene and the accuracy that is measured by

independent control sets may be less than what was actually presumed. The accuracy is such

that only 50% of the exons are actually identified. GENSCAN uses a general probabilistic

model for the human genomic sequences. The overall architecture that the model uses is the

Generalized Hidden Markov Model. This algorithm differs from the other algorithms in the

followings aspects:

i) A double-stranded DNA sequence is considered with potential genes on both the sides of the

DNA which are analyzed simultaneously and in an integrated fashion.

ii) The assumption that other algorithms have made, that the input sequence has exactly one

complete gene is not made here. This model considers the fact that an input sequence may

have a partial gene, a complete gene, multiple complete genes or no gene at all.

iii) It introduces a new method, Maximum Dependence Decomposition, to model the functional

signals in DNA sequences

Chapter 3

SOFTWARE REQUIREMENTS

SPECIFICATION

3.1 Introduction

Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have

a well-defined nucleus and they have a single chromosome which is contained within a

nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a

well-defined nucleus with many chromosomes which are large and linear.


region. Gene Recognition is a particularly difficult problem in Bioinformatics. The

complexity that the DNA sequences involve makes the task even more daunting. The

solution to this problem can be found to a certain extent by using Hidden Markov Models.

A Hidden Markov Model is a generalization of a Markov chain, in which each (“internal”)

state is not directly observable (hence the term hidden) but produces (“emits”) an observable

random output (“external”) state, also called “emission”, according to a given stationary

probability law. In this case, the time evolution of the internal states can be induced only

through the sequence of the observed output states. If the number of internal states is N, the

transition probability law is described by a matrix with N times N values; if the number of

emissions is M, the emission probability law is described by a matrix with N times M values.

A model is considered defined once given these two matrices and the initial distribution of

the internal states.

The most used algorithms in Hidden Markov Models are:

1. The Forward Algorithm: To find the probability of emission distribution (given a model)

starting from the beginning of the sequence.

2. The Backward Algorithm: find the probability of emission distribution (given a model)

starting from the end of the sequence.

3. Viterbi algorithm: To find the sequence of internal states that has, as a whole, the highest

probability. The most used algorithm is the Viterbi algorithm.

3.1.1 Purpose of the Project

The purpose of this project is to annotate the unannotated DNA sequences of various

species such as Saccharomyces cerevisiae, Homo sapiens etc.

3.1.2 Scope of the Project

This project aims to recognize the genes thus helping in better drug discovery and a

better understanding of the organisms.

3.2 General Description

3.2.1 Project Perspective

In this project we aim recognize genes in an unannotated sequence of DNA. The gene will be

recognized with the start and the stop codon appropriately marked.

3.2.2 End User Expectation

The output to be expected will be the recognized genomic sequence of a particular species

with all the parts of the gene properly identified such as the promoter, the introns, exons and

the terminator region.

3.2.3 General Constraints

We come across a large number of constraints in gene recognition. Since the start codon,

ATG, codes for the amino acid methionine it may appear even in the coding region. This

poses a problem in order to identify the start of the gene. The promoters help in identifying

the start of the gene. But, in eukaryotic organisms the promoter sequences are many and

quite complex. The stop codons, TAA, TGA and TAG, also occur in the intergenic region.

This also is considered as a constraint. Another constraint that we may come across is that

the model cannot be made to work for sequences of all the species.

3.2.4 Assumptions and Dependencies

The assumptions that we are making in our project is that the input sequence is complete and

does not contain any partial genes.

3.3 Specific Requirements

3.3.1 Functional Requirements

The functional requirements are as follows:

i) Take an unannotated input sequence and annotate it.

ii) Visualize the annotated output sequence.

3.3.2 Non-Functional Requirements

The performance required from the Forward and Backward Algorithm must be O( N2L)

where L is the length of the sequence and N stands for the number of states in the

model.

3.3.3 Software System Requirements

This gene recognition tool requires that the operating system be a Linux operating system

with Apache server.

3.3.4 Hardware System Requirements

Processor: Intel Pentium (1.8 GHz)

RAM: 1GB

Hard Disk: 80 GB

Cache: 2 MB L2 Cache

3.4 Interface Requirements

3.4.1 User Interface

We need one GUI for accepting the input sequence and more GUI for visualizing the output

annotated sequence.

Fig. 3.1 Use-case Diagram

Fig 3.2 Flowchart

3.5 Performance

The efficiency expected of the forward and backward algorithm is of the order of

O(N2L). At run time, warnings are issued when iterators follow an inconsistent or out-

of- bounds path, and when negative probabilities are encountered. Also, to improve the

efficiency, constant transition probabilities are pre-computed outside the main loop

whenever possible. Finally, all loops over transitions and states are unrolled, reducing

the number of run-time decisions and lookups.

Chapter 4

SYSTEM DESIGN

4.1 Introduction and Design Overview

This project deals with one of the most challenging problems in Bioinformatics i.e. Gene

Recognition. Gene prediction in DNA has become one of the foremost computational






This document deals with the design of the architecture used for this project. The problem of

Gene Recognition has been solved by using numerous methods. But, by far, Hidden Markov

Models are most widely used. Hidden Markov Models do not present the problems that occur

while Gene Recognition is being performed by pattern recognition.

The input will be a DNA sequence in FASTA format that is not annotated. It then undergoes

preprocessing in order to be checked for the correct format and to be copied on to a

temporary file without the description of the sequence. After this is done, the sequence goes

through the computation phase where the Viterbi, forward-backward algorithms are used to

find the optimal path, hence finding the gene. This output is then displayed to the user.

4.2 System Architectural Design

4.2.1 Chosen System Architecture

Fig 4.1 System Architecture

The system has four components:1. Input Component

2. Preprocessing

3. Computational Component

4. Output Component

4.2.2 Discussion of Alternative Designs

One of the alternative designs that were proposed for the recognition of genes was that of

pattern matching using parallel programming. It involves the recognition of start codons,

promoter sequences, terminator codons and intrinsic terminator sequences in a given DNA

sequence, hence recognizing the gene. But this design failed due to several reasons and it was

also found to be an inefficient way of locating genes.

The first step involved identifying the terminator codons in parallel by dividing the input

sequence into a fixed number of nucleotides. But, the major drawback here was that certain

sequences in the intergenic region also matched the terminator codons.

Another major drawback is that, the start codon, ATG, also codes for the protein methionine.

Hence, the sequence ATG may be present in the coding region also. This makes it difficult to

recognize the start of the gene using pattern matching.

Another difficulty that was encountered was regarding the promoter sequences. Prokaryotic

genes have two fixed promoter sequences that mark the start of a gene. But, when eukaryotic

genes are considered the complexity of the promoter sequences increases. The sequences are

extremely diverse and difficult to characterize. They lie several kilobases away from the

transcriptional start site which makes it highly inefficient to search for it using pattern

matching. The nucleotides in the consensus sequence may also vary from gene to gene.

Hence, it becomes very difficult for the consensus sequence to be searched for in the given

DNA sequence.

The termination of a gene may also be identified using sequences known as intrinsic

terminators. The intrinsic terminators are a sequence of inverted repeat (5’ CAGTTA|

TAACTG 3’) followed by up to six thymine nucleotides (TTTTTT). But, the complexity

involved in this would be that the inverted repeat may be of any length.

Partial genes would also pose a problem to recognize a gene using pattern recognition.

Hence, we planned to use Hidden Markov Models for gene recognition.

4.3 Detailed Description of Components

4.3.1 Input Component

The input component takes a text file containing DNA sequences in FASTA format as input.

This acts as the sequence that has to be annotated.

Fig 4.2 Input Component

4.3.2 Preprocessing

The description in the input file is removed and the remaining input, which is the DNA

sequence, is copied onto another temporary file. If the user enters data is any format other

than FASTA then appropriate error messages are displayed.

Fig 4.3 Preprocessing

4.3.3 Computational Component

The computational component involved in this project is the application of Hidden Markov

Models. The preprocessed sequence is taken as input and the Viterbi algorithm and forward-

backward algorithm are applied to it. The Viterbi algorithm is a dynamic programming

algorithm for finding the most likely sequence of hidden states, called the Viterbi path, which

will generate a given output sequence given the model parameters. The forward-backward

algorithm is an inference algorithm for hidden Markov models which computes the posterior

marginals of all hidden state variables given a sequence of observations/emissions.

4.3.4 Output Component

This component takes the annotated sequence from the computational component and

displays the gene.

Fig 4.4 Output Component

4.4 User Interface Design

4.4.1 Description of the User Interface

The user interface for this project involves page where the user may input the DNA sequence

or may upload a file containing the sequence. The output is also displayed on the same page.

4.4.2 Screen Images

Fig 4.5 Input Screen Shot 1

The user can enter his DNA sequence in the box provided with the label “Enter the input

sequence in FASTA format”. If the user inputs a sequence that is not in FASTA format a

dialog box opens with an error message asking the user to input the sequence in the correct

format.

Fig 4.6 Input Screen Shot 2

If the user wants to upload a file containing the input sequence instead of directly inputting

the sequence he may do so by clicking the “Upload File” button. This opens a box wherein

the user may type the name of the file that contains the input sequence.

Fig 4.4 Output Screen Shot

Once the user inputs the sequence he may click “Submit” to obtain the annotated sequence in

the box provided below “The annotated sequence is”.

4.4.3 Objects and Actions

The objects present on the user interface are:

Text Area1: To enter the input sequence directly. The text area is preceded by a statement

“Enter the input sequence in FASTA format”.

“Upload File” button: If the user does not want to input the DNA sequence directly he may

use this button to upload a file that contains the input sequence. When this button is clicked

a text area appears where the user may enter the file name.

“Submit” button: This button may be clicked when the sequence or the file has to be

submitted for annotating the sequence that they contain. Once this button is clicked the

sequence gets annotated and the output is displayed.

Text Area2: This text area is used to display the output i.e. the annotated sequence. It is

preceded by “The annotated output sequence is”.

“Exit” button: This button is used to exit the screen.

4.5 Test Plan

4.4.1 Features to be Tested

The features to be tested are as follows:

Features to be

tested

Input Expected Output

Input in FASTA format Sequence entered by the user No- Error messages should

be displayed

Yes- Proceed

Preprocessing Sequence entered by the user No- Error

Yes- Temporary file

containing only the sequence

should be created

Start codon identification Temporary file with

sequence

ATG sequence should be

identified at the proper

position

Terminator codon

identification

Temporary file with

sequence

TAA/TAG/TGA should be

identified at the correct

position

Exons and Introns

identification

Temporary file with

sequence

Exons and introns should

follow the AG-GT rule and

be identified at the proper

position

Table 4.1 Features to be Tested

Chapter 5

IMPLEMENTATION

5.1 Hidden Markov Models in Gene Recognition

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties

of observed real world data. A good HMM accurately models the real world source of the

observed data and has the ability to simulate the source. Machine Learning techniques based

on HMMs have been successfully applied to problems including speech recognition, optical

character recognition, and problems in computational biology. The main computational

biology problems with HMM-based solutions are protein family profiling, protein binding

site recognition and the problem that is the topic of this paper, gene finding in DNA. Gene

finding or gene prediction in DNA has become one of the foremost computational biology

problems for two reasons. Firstly, because completely sequenced genomes have become

readily available and most importantly, because of the need to extract actual biological

knowledge from this data to explain the molecular interactions that occur in cells and to

define important cellular pathways. Discovering the location of genes on the genome is the

first step towards building such a body of knowledge.

A basic Markov model of a process is a model where each state corresponds to an observable

event and the state transition probabilities depend only on the current and predecessor state.

This model is extended to a Hidden Markov model for application to more complex

processes, including speech recognition and computational gene finding. A generalized

Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output

symbols, a set of state transition probabilities and a set of emission probabilities. The

emission probabilities specify the distribution of output symbols that may be emitted from

each state. Therefore in a hidden model, there are two stochastic processes; the process of

moving between states and the process of emitting an output sequence.

The problem of finding genes in DNA has been studied for many years. It was one of the first

problems tackled once sufficient genomic data became available. The problem is given a

sequence of DNA, determine the locations of genes, which are the regions containing

information that code for proteins. At a very general level, nucleotides can be classified as

belonging to coding regions in a gene, non-coding regions in a gene or intergenic regions.

The problem of gene finding can then be stated as follows:

Input: A sequence of DNA X = (x1….xn) € ∑*, where ∑= A,C, G, T.

Output: Correct labeling of each element in X as belonging to a coding region, non-coding

region or intergenic region.

Gene finding becomes complicated when the problem is approached in more biological

detail. A eukaryotic gene contains coding regions called exons which may be interrupted by

non-coding regions called introns. The exons and introns are separated by splice sites.

Regions outside genes are called intergenic. The goal of gene finding is then to annotate the

sets of genomic data with the location of genes and within these genes, specific areas such as

promoter regions, introns and exons.

The Hidden Markov Model that we have used in our project is a shown below:

Start Background Start Codon Gene Stop Endfull bgstart fullgenestop

extend

bgend

full

bgbg

emit background emit start emit codon emit stop empty

BLOCK 1 BLOCK 2 BLOCK 3

Fig 5.1 Gene Model

The Gene Model shown above consists of six states, viz, Start, Background, Start Codon,

Gene, Stop and End. These six states are divided into three blocks, viz, Block 1, Block 2 and

Block 3. Block 1 consists of just the Start state which is the start of scanning the sequence.

Block 2 consists of 4 states, Background, Start Codon, Gene and Stop. Block 3 consists of a

single state, End. These states constitute the finite set of states of the Hidden Markov Model.

The alphabet of output symbols in this HMM are the nucleotides A, T, G, and C.

The sequence moves from one state to another using state transition probabilities. The

transition from the Start state to the Background State has ‘full’ probability, i.e. 1.0. The

Background state has emission probability ‘emitbackground’. The Background state takes

care of the intergenic region. Hence the nucleotides encountered may be A, T, G or C.

Therefore, the emission probability for Background is emitbackground= 0.25. The model

remains in the Background state until a start codon is encountered. The transition probability

to remain in the Background state is ‘bgbg’ i.e. full-bgend-bgstart. Once the start codon ATG

is found, the model moves from Background state to Start Codon state.

The transition from Background state to Start Codon state has the probability ‘bgstart= Gene

Density’. The emission probability of Start Codon state is ‘emitstart’. If an ATG is

encountered then the emission probability is 1.0 otherwise it is considered as 0.0. Once a start

codon is encountered it leads to a gene, hence the model moves from the Start Codon state to

the Gene state. The transition probability for this transition is ‘full’ i.e. 1.0 since once the

start codon is found, the gene is found. There is more than a single codon that is encountered

when we are in the Gene state. Hence, the probability of staying in the Gene state is ‘extend=

1.0- (1.0/ Gene Length)’.

The emission probability for the Gene state is ‘emitcodon’. This state emits any of the 61

codons (except TAA, TGA and TAG) that code for the amino acids that form the protein.

Hence, if these codons are found the probability is 1/61 otherwise if stop codons are found

the emission probability is 0.0. Then, a transition to the next state is made.

The next state is the Stop Codon state which is entered if any of the stop codons (TAA, TAG

or TGA) are encountered. The transition probability from Gene state to the Stop state is

‘genestop= 1.0/ Gene Length’. Once the stop codons are encountered it marks the end of that

particular gene. But the DNA sequence may have more nucleotides. Hence, the transition

takes place from Stop state to Background state. This transition probability is ‘full’ i.e. 1.0.

The Stop state emits the stop codons, hence the emission probability of the state is 1/3 if a

stop codon is found otherwise it is 0.0.

Once the transition has taken place from Stop to Background the entire sequence is scanned

and finally a transition from Background to End state takes place with a transition probability

of ‘bgend=0.0001’. The End state belongs to Block 3 of the model and its emission is

‘empty’.

The algorithms used in the implementation of this project are Forward-Backward Algorithm

and the Viterbi Algorithm.

The forward-backward algorithm is an inference algorithm for hidden Markov models which

computes the posterior marginals of all hidden state variables given a sequence of

observations/emissions , i.e. it computes, for all hidden state variables

, the distribution . The algorithm makes use of the

principle of dynamic programming to efficiently compute the values that are required to

obtain the posterior marginal distributions in two passes. The first pass goes forward in time

while the second goes backward in time; hence the name forward-backward algorithm.

The forward algorithm is used to find the next likely state in the given finite set of states. In

our implementation the forward algorithm is implemented in a function called Forward

which returns the next likely state. In this function the 8 transitions are defined according to

our gene model. It scans the sequence in the forward direction and finds the likely states.

The backward algorithm is also used to find the next likely state in a given finite set of states

but it scans the sequence in the reverse direction. In our implementation the backward

algorithm has been implemented in the function called Backward. This function also returns

the next likely state.

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely

sequence of hidden states, called the Viterbi path, that results in a sequence of observed

events, especially in the context of Markov information sources, and more generally, hidden

Markov models.

The algorithm makes a number of assumptions.

First, both the observed events and hidden events must be in a sequence. This

sequence often corresponds to time.

Second, these two sequences need to be aligned, and an instance of an observed event

needs to correspond to exactly one instance of a hidden event.

Third, computing the most likely hidden sequence up to a certain point t must depend

only on the observed event at point t, and the most likely sequence at point t − 1.

In the implementation we have implemented Viterbi algorithm in 2 functions, Viterbi_trace

and Viterbi_recurse. Viterbi_trace finds the most likely path in the forward direction whereas

Viterbi_recurse finds the path in the reverse direction. These two functions return the most

likely path followed by the sequence. Another function called addEdge is used to assign an

edge from the ‘from’ state to the ‘to’ state with transitions, probability and emissions as

parameters.

Chapter 6

TESTING

6.1 Introduction

6.1.1 System Overview

The implementation of our code has been done on the Linux operating system. The language

that has been used for coding is C++ . The front end has been developed using PHP.

6.1.2 Test Approach

The input sequences have been taken from the GENBANK database. The output is tested

manually in comparison to the details from the GENBANK database.

6.2 Test Cases

6.2.1 Case-1

6.2.1.1 Purpose

To test if the gene has been correctly identified with the start and stop codon appropriately

marked.

6.2.1.2 Input

The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a

file of .fasta format. The sequence is:

>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p

(AXL2) and Rev7p (REV7) genes, complete cds

GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACA

TGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAA

GCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACA

ACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATT

AGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCG

TCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGT

ACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCAC

AACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAA

CTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGA

AGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATC

TACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTAC

TATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCC

CCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGT

AGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAG

TTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTC

AATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTT

GTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTAT

GGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGAC

CGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCG

CCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATA

AACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT

TCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAAT

AGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTAT

CTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGG

GTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAAT

CCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTG

TCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGT

TCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTAC

TAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAG

TGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAG

CTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACG

TCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAA

ATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTT

CATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAG

TAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTA

CCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTAC

ACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATT

GGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGA

TGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAG

AAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGG

TCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTG

TCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACT

TTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGG

ACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAAC

GTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAAT

CACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTT

TTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAA

ATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGA

TTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATAC

CCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTT

TGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAA

ATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAA

CCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCT

TCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCT

GTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTG

TCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCG

TCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTA

AAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGT

GCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTAT

TAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGG

TATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCT

ACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCG

CAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAAC

AGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAA

ATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTT

ATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAG

AATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAAT

AACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCAT

CACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAAT

CATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGA

ATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGC

TCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAG

CTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTT

CAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTT

TTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTT

CTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTG

ATC

6.2.1.3 Expected Output & Pass/ Fail Criteria

The expected output is:

Gene Recognition: (upper case represents the Gene/s)

gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacga

gcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatattta

ggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaaagacgcgaaaaaaaaagaacaa

cgcgtcatagaacttttggcaattcgcgtcacaaataaattttggcaacttatgtttcctcttcgagcagtactcgagccctgtctcaagaatgtaataatacccatcgta

ggtatggttaaagatagcatctccacaacctcaaagctccttgccgagagtcgccctcctttgtcgagtaattttcacttttcatatgagaacttattttcttattctttactct

cacatcctgtagtgattgacactgcaacagccaccatcactagaagaacagaacaattacttaatagaaaaattatatcttcctcgaaacgatttcctgcttccaacatc

tacgtatatcaagaagcattcacttaccATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACT

CCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAA

GAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAG

CTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTT

CTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTC

GAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGT

CCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAAC

GGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTC

ACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAAT

TGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATT

GCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAG

GTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC

AACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGAT

CCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGAT

AATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTT

TCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGG

ATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTT

TTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCA

AGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATT

TCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACA

TCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGT

TCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCT

CCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATA

AAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTT

GCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATT

AGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAA

CCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAA

CACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAG

ATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAG

CAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGT

ATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCT

GATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGA

AGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACA

GCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCAT

CGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAAC

AATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCAT

AGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATG

TCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGAttatacgcaacgatattttgctt

aattttattttcctgttttattttttattagtggtttacagataccctatattttatttagtttttatacttagagacatttaattttaattccattcttcaaatttcatttttgcacttaaaa

caaagatccaaaaatgctctcgccctcttcatattgagaatacactccattcaaaattttgtcgtcaccgctgattaatttttcactaaactgatgaataatcaaaggcccc

acgtcagaaccgactaaagaagtgagttttattttaggaggttgaaaaccattattgtctggtaaattttcatcttcttgacatttaacccagtttgaatccctttcaatttctg

ctttttcctccaaactatcgaccctcctgtttctgtccaacttatgtcctagttccaattcgatcgcattaataactgcttcaaatgttattgtgtcatcgttgactttaggtaatt

tctccaaatgcataatcaaactatttaaggaagatcggaattcgtcgaacacttcagtttccgtaatgatctgatcgtctttatccacatgttgtaattcactaaaatctaaa

acgtatttttcaatgcataaatcgttctttttattaataatgcagatggaaaatctgtaaacgtgcgttaatttagaaagaacatccagtataagttcttctatatagtcaatta

aagcaggatgcctattaatgggaacgaactgcggcaagttgaatgactggtaagtagtgtagtcgaatgactgaggtgggtatacatttctataaaataaaatcaaat

taatgtagcattttaagtataccctcagccacttctctacccatctattcataaagctgacgcaacgattactattttttttttcttcttggatctcagtcgtcgcaaaaacgta

taccttctttttccgaccttttttttagctttctggaaaagtttatattagttaaacagggtctagtcttagtgtgaaagctagtggtttcgattgactgatattaagaaagtgg

aaattaaattagtagtgtagacgtatatgcatatgtatttctcgcctgtttatgtttctacgtacttttgatttatagcaaggggaaaagaaatacatactattttttggtaaag

gtgaaagcataatgtaaaagctagaataaaatggacgaaataaagagaggcttagttcatcttttttccaaaaagcacccaatgataataactaaaatgaaaaggattt

gccatctgtcagcaacatcagttgtgtgagcaataataaaatcatcacctccgttgcctttagcgcgtttgtcgtttgtatcttccgtaattttagtcttatcaatgggaatc

ataaattttccaatgaattagcaatttcgtccaattctttttgagcttcttcatatttgctttggaattcttcgcacttcttttcccattcatctctttcttcttccaaagcaacgatc

cttctacccatttgctcagagttcaaatcggcctctttcagtttatccattgcttccttcagtttggcttcactgtcttctagctgttgttctagatcctggtttttcttggtgtagt

tctcattattagatctcaagttattggagtcttcagccaattgctttgtatcagacaattgactctctaacttctccacttcactgtcgagttgctcgtttttagcggacaaag

atttaatctcgttttctttttcagtgttagattgctctaattctttgagctgttctctcagctcctcatatttttcttgccatgactcagattctaattttaagctattcaatttctctttg

atc

The gene should been correctly identified with the ATG and the stop codon TGA

appropriately marked.

6.2.1.4 Test Procedure

The input file containing the DNA sequence is entered in the space provided and the program

is run. The output that is obtained is matched with the GENBANK data to verify if the gene

has been correctly identified.

6.2.1.5 Test Results

The gene has been correctly identified by the program with the ATG and TGA correctly

marked as verified from the GENBANK record.

6.2.2 Case-2

6.2.2.1 Purpose

To test if the gene has been correctly identified with the start and stop codon appropriately

marked when the sequence ends with the gene.

6.2.2.2 Input

The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a

file of .fasta format. The sequence is:

>gi|296148533|ref|NM_001183937.1| Saccharomyces cerevisiae S288c Rny1p (RNY1) mRNA, complete cds

ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT

ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG

TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA

AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG

ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA

TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA

AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA

TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG

GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC

AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA

TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT

GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG

AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT

TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC

GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG

GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA

GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA

ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT

TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC

TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA

6.2.2.3 Expected Output & Pass/ Fail Criteria

The expected output is:

Gene Recognition: (upper case represents the Gene/s)

ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT

ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG

TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA

AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG

ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA

TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA

AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA

TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG

GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC

AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA

TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT

GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG

AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT

TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC

GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG

GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA

GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA

ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT

TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC

TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA

The entire sequence should be in capitals since it is the entire gene.

6.2.2.4 Test Procedure

The input file containing the DNA sequence is entered in the space provided and the program

is run. The output that is obtained is matched with the GENBANK data to verify if the gene

has been correctly identified.

6.2.2.5 Test Results

The gene has not been correctly identified by the program with the ATG and TAA marked as

verified from the GENBANK record. Hence, this is a fail case.

Chapter 7

CONCLUSION & FUTURE ENHANCEMENTS

In the previous chapters are described the model that we have used to find genes in the given

DNA sequences. Our algorithm has been tested on Saccharomyces cerevisiae, Drosophila

melanogaster, Zea Mays (maize) and Homo Sapiens sequences. It has been found to work

partially on the sequences of Homo Sapiens and achieve about 80-90% success for the other

sequences.

This field has a lot of future work to be done. The genes of many other organisms still need

to be identified and with greater efficiency. The partial genes also have not been able to be

identified using this program and a significant future enhancement would be to be able to

identify partial genes. Multiple genes have been taken care of to a certain extent. Also, the

recognition of promoter sequences is a daunting task and future enhancements can also take

care of this. The introns and exons also need to be identified with greater efficiency.

Chapter 8

BIBLIOGRAPHY & REFERENCES

[1] Rajeev K. Azad, Mark Borodovsky, Probabilistic methods of identifying genes in

prokaryotic genomes: Connections to the HMM theory, Henry Stewart Publications 1477-

4054. Briefings in bioinformatics. Vol 5. No 2. 118–130. June 2004

[2] Alexandre Lomsadze, Vardges Ter-Hovhannisyan, Yury O. Chernoff, Mark Borodovsky,

Gene identification in novel eukaryotic genomes by self-training algorithm, 6494–6506

Nucleic Acids Research, 2005, Vol. 33, No. 20

[3] David Sankoff, The Early Introduction Of Dynamic Programming Into Computational

Biology, Oxford University Press, 2000, Vol 16 no1, Pages 41-47

[4] P.S. Novichkov, M.S. Gelfand, A.A. Mironov, Gene Recognition in Eukaryotic DNA by

Comparison of Genomic Sequences, Oxford University Press, 2001, Vol 17 no11, Pages

1011-1018

[5] Chris Burge, Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic

DNA, J. Mol. Biol. (1997) 268, 78-94

[6] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White, Microbial gene

identification using interpolated Markov models, Nucleic Acids Research, 1998, Vol. 26, No.

2, 544-548

[7] Sanja Rogic, B.F. Francis Ouellette, Alan K. Mackworth, Improving gene recognition

accuracy by combining predictions from two gene-finding programs, Oxford University

Press, 2002, Vol 18 no8, Pages 1034-1045

[8] Andrey A. Mironov, Pavel S. Novichkov, Mikhail S. Gelfand, Pro-Frame: similarity-

based gene recognition in eukaryotic DNA sequences with errors, Oxford University Press,

2001, Vol 17 no 1, Pages 13-15

[9] Y.V. Kondrakhin, A.E. Kel, N.A. Kolchanov, A.G. Romashchenko, L. Milanesi,

Eukaryotic Promoter Recognition by Binding Sites for Transcription Factors, Computer

Applications in Biosciences 11: 477-488, 1995

[10] Anton M.Shmatkov, Arik A.Melikyan, Felix L.Chernousko, Mark Borodovsky, Finding

Prokaryotic Genes by the ‘frame-by-frame’ Algorithm: Targeting Gene Starts and

Overlapping Genes, Bioinformatics, 15: 874-886, 1999

[11] Peter M. Hooper, Haiyan Zhang, David S. Wishart, Prediction of Genetic Structure in

Eukaryotic DNA using Reference Point Logistic Regression and Sequence Alignment,

Computer Applications in Biosciences, 16: 425 – 438, 2000

[12] Christopher Burge, Identification Of Genes In Human Genomic DNA, March 2007

[13] Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human

DNA, 308-761 Machine Learning Projec,t Kaleigh Smith, January 17, 2002

[14] Ion I. Mandoiu and Alexander Zelikovsky, Bioinformatics Algorithms-Techniques and

Applications, John Wiley & Sons, 2008

[15] Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms, The

MIT Press, 2004

[16] David W. Mount, Bioinformatics- Sequence and Genome Analysis, Cold Spring Harbor

Laboratory Press

[17] Valeria De Fonzo, Filippo Aluffi-Pentini, Valerio Parisi, Hidden Markov Models in

Bioinformatics, Current Bioinformatics, 2007, 2, 49-61

[18] Catherine Mathe, Marie-France Sagot, Thomas Schiex, Pierre Rouze, Current methods

of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30

No. 19 4103-4117

Documents

Gene Recognition