Upload
claire
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Gene Prediction. Increase in the recent 12 months. Motivation. The genome sequencing projects are progressing very fast: human, mouse, rat, rise, anopheles… The identification of genes is the next important step in the analysis of genomes. How can we identify genes in the sequence?. - PowerPoint PPT Presentation
Citation preview
Gene Prediction
Increase in the recent 12 months
Motivation• The genome sequencing projects are progressing very fast: human, mouse, rat, rise, anopheles…
• The identification of genes is the next important step in the analysis of genomes.
• How can we identify genes in the sequence?
Gene Identification• Identify genes coding for known proteins.
• only few proteins are known
• Identify genes based on homology with other genomes.
• Identifying genes based on gene characteristics.
Gene Characteristics• Differences between prokaryotic and eukaryotic gene characteristics:
• Prokaryotes - the genome is more compact. several genes may reside on the same mRNA in different reading frames.• Eukaryotes - a gene may contain introns. The human genome: average gene ~ 27,800b.exon ~ 100b. intron 100-30,000 b.• Promoter regions are different – in prokaryotes the signals are more conserved.
• Differences between different types of genes in the same genome. Each type has its own characteristics.
• Differences between prokaryotic and eukaryotic gene characteristics:
Prediction Approaches
• The problem of gene prediction is very much open even in well studied genomes
• The number of genes in yeast keeps changing. • The identification of promoter regions in E. Coli is considered a great challenge of bioinformatics.
• Next we consider prediction of the following:• Protein coding genes (ORFs).• Functional RNA coding genes.
ORF Finding
Open Reading Frames – sequences that code for proteins.
• How can ORFs be detected?
All reading frames are checked.Search for initiation and termination codons within a sequence.
Are these codons totally conserved?
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
The genetic code is not totally conserved
Prediction of Protein-Coding Genes
• Three types of post-transcriptional events make prediction difficult:
• genetic code; alternative splicing; RNA editing.
• DNA is not a random choice of possible codons for each amino acid. It is an ordered list of codons that reflects evolutionary origin and constraints related to gene expression.
• Each species has its own coding preferences – codon usage.
• T he genet ic code - Each amino acid is coded by 3 nucleot ides, named codon.• Code redundancy - M ost amino acids are coded by several codons.
- 64 t r iplet s code f or 20 amino acids & 3 st op codons.
T he Genet ic Code
General Codon Preferences• Codon usage is different in genes coding for highly /weakly expressed genes.
• in E. Coli genes were divided into 3 groups based on their codon usage– - regular genes (70%) - highly expressed genes (15%) - horizontally transferred genes (15%)
• There is strong preferences in ORFs for specific codon pairs and for specific codons near terminators.
• The base in the third position in each codon tends to repeat itself in the same ORF.
Signal Based Identification• Prokaryotes - signals such as the RBS – Ribosome Binding Site (Shine-Dalgarno) – are conserved.Located ~ -15 upstream AUG. (in B. Subtilis RBS is AGGAGG)
• Eukaryotes• Transcription signalsTATA (~-30 TSS), cap signal, poly-adenylation site. Any signal may be missing.• Translation signalsKozak signal (immediately upstream ATG), termination codon.• Splicing signals - the Spliceosome recognizes: donor and acceptor sites - introns usually start with GT and end with AG.branch point – inside the intron.
Prediction Reliability Tests
Where no experimental verification is available, reliability of prediction can be measured by:
1. Third base repeat in an ORF - does not require any prior knowledge
2. Codon usage - requires prior knowledge per species.
3. Predicted-protein sequence comparison - if homologs are found prediction is more reliable.Homologs can be searched in protein databases, EST databases, cDNA databases, etc.The quality of the results depends on the quality of the database (EST – error prone).
Computational Approaches to Prediction• Gene prediction is carried out by various computational methods including decision trees, neural nets, Markov models and Hidden Markov models (HMM).
• A model is studied based on known genes, and then applied to genomic sequences.
• Each genome defines its own model.
Markov Models – Probabilistic ApproachMarkov model - can be modeled by states and the probability of transition from one state to the next.Markov chain – progresses in steps; each step corresponds to a move between states.The probability of being at state X in step i depends only on the state we reached at step i-1.
It has been found that ORFs have a reading-frame specific hexamer (6mer) composition. => the probability of the 6th base can be computed using the previous 5.=> The probability that a sequence is an ORF in a specific reading frame can be computed.
Finding the genes in genomic DNA.Chris Burge and Samuel KarlinCurr Opin Struct Biol. 1998 Jun;8(3):346-54. Review.
Markov Models – Figure Legend
Circle represent DNA bases or states. Numbers indicate codon positions. Arrows indicate dependency.
a) Three periodic 5th order Markov models. The next base is generated conditionally on the previous 5 bases and on the codon position.
b) Homogenous 5th order Markov Model.c) Hidden Markov model. Upper circles represent hidden
states, corresponding to whether the position is coding or non coding; upper arrows indicate that the states are generated according to a first order Markov Model. Lower circuits correspond to DNA bases. lower arrows indicate that each base is generated conditionally on the identity of the hidden state.
d) As c) with variable lengths of the hidden states.
Prediction of Complete Gene Structures in Human Genomic DNA, Chris Burge and Samuel Karlin
J Mol Biol. 1997 Apr 25;268(1):78-94.
Gene Prediction Tools• Glimmer at TIGR (The Institute of Genetics
Research). • GeneMark at Georgia Tech. • Grail at Oak Ridge National Laboratory• Genefinder at Baylor College of Medicine• Genscan at MIT
Prediction tools are compared using two criteria:• Sensitivity - % true predicted genes out of the
true genes in the genome.• Specificity - % true predicted genes out of the
total number of predicted genes.Both need to be high (correlation tests ~ 0.7-
0.9).
The General Scheme
1. Obtain new genomic DNA sequence.
2. A) Translate in all 6 reading frames and compare to protein databases.
b) Perform database similarity search of expressed sequence tags (EST) database of same organism, or cDNA sequences if available.
3. Use gene prediction program to locate genes.
4. Analyze regulatory sequences in the gene (signals).
Can help characterize putative genes.
Functional RNA Genes
• RNA genes are transcribed but are not translated – no codon preference exists.How can rRNA, tRNA and small RNA genes be predicted?
• Promoter regions can be characterized, but remain a big challenge.
• RNA secondary structure is important.Can be predicted using RNA structure prediction tools (MFOLD tool).
Characteristics of Characteristics of E. coliE. coli promoterspromoters
TTGACA
-35 hexamerspacer
TATAAT
-10 hexamer
Transcription start site
interval
Actual promoters exhibit large sequence variation.Actual promoters exhibit large sequence variation.Upon predicting promoters – known ones are missing, Upon predicting promoters – known ones are missing, many false ones emerge. many false ones emerge.
15 to 19 bases 5 to 9 bases
UUUU5’ 3’
Loop of 3-8 bases
At least 4 U residues
Stem loop structure free energy below -7 kcal/mole
Characteristics Characteristics ofof rhorho-independent -independent terminatorsterminators
Stem of 5-10 base pairs at least 60% GC
Identifying Small RNAs
• The role of small RNAs (sRNA) is a hot topic in current biology.sRNA genes fill many regulatory functions, e.g., regulating translation of mRNA (antisense).They are hard to find experimentally.
• A group of researchers from the Hebrew University and from Sweden combined bioinformatic predictions with experimental verification.Argaman et. Al – Current Biology 2001.
Identifying Small RNAs
• Based on 10 known sRNA in E. Coli they predicted 24 sRNAs, of which 14 were experimentally verified.
• 3 successive studies identified ~ 20 more sRNA genes in E. Coli.
Locate “empty” regions in the E. coli genome
ORF A
ORF B
ORF C
“Empty” regions
Search for promoter DNA sequences recognized by 70 of RNA polymerase
Identify rho-independent terminators
Extract sequences in which the distance between the promoter and the terminator is 50 to 400 bases.
-35 -10
Promoter +1 Terminator
50-400 bases
TTTT
Check sequences for conservation in other bacteria
-35 -10
Promoter +1
Predictive Predictive schemescheme
Argaman et. Al – Current Biology 2001.