Upload
myron-glenn
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Biological MotivationGene Finding in Eukaryotic Genomes
Anne R. HaakeRhys Price Jones
Recall from our previous discussion of gene finding in prokaryotes:
The major strategies in gene finding programs are to look for:
Signals/Features Content/Composition Similarity to known genes (BLAST!)
3 Major Categories of Information used in Gene Finding Programs Signals/features = a sequence pattern with
functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites
Content/composition -statistical properties of coding vs. non-coding regions.
e.g. codon-bias; length of ORFs in prokaryotes; CpG islands GC content
Similarity-compare DNA sequence to known sequences in database
Not only known proteins but also ESTs, cDNAs
In Prokaryotic Genomes We usually start by looking for an ORF
A start codon, followed by (usually) at least 60 amino acid codons before a stop codon occurs
Or by searching for similarity to a known ORF Look for basal signals
Transcription (the promoter consensus and the termination consensus)
Translation (ribosome binding site: the Shine-Dalgarno sequence)
Look for differences in sequence content between coding and non-coding DNA GC content and codon bias
The Complicating factors in Eukaryotes
• Interrupted genes (split genes) introns and exons
• Large genomes• Most DNA is non-coding
introns, regulatory regions, “junk” DNA (unknown function)
About 3% coding• Complex regulation of gene expression • Regulatory sequences may be far away
from start codon
Some numbers to consider: Vertebrate genes average about 30Kb long
varies a lot Coding region is only about 1-2 Kb Exon sizes and numbers vary a lot
Average is 6 exons, each about 150 bp long An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp
(both can be much longer) There are huge deviations from all of these numbers
e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26 exons, introns are up to 32 Kb (one intron produces 2 transcripts unrelated to the gene!)
There are genes without introns: called single-exon or intronless genes
Eukaryotic Gene Structure
www.bio.purdue.edu/courses/biol516/eukgenestructure.gif
Given a long eukaryotic DNA sequence:
How would you determine if it had a gene?
How would you determine which substrings of the sequence contained protein-coding regions?
In prokaryotic genomes we usually start by looking for ORFs.
Is this a good approach for the eukaryotic genome? Why or why not?
So, what’s the problem with looking for ORFs?
“split” genes make it difficult to define ORFs
Where are the stops and stops? What problems do introns
introduce? What would you predict for the size
of ORFs? (you can’t with any certainty!)
Most Programs Concentrate on Finding Exons Exon: the region of DNA within a gene
that codes for a polypeptide chain or domain
Intron: non-coding sequences found in the structural genes
Splice Sites used to Define Exons
Splice donor (exon-intron boundary) and splice acceptor (intron-exon boundary)
are consensus sequences A statistical determination of the
pattern;approximates the pattern C(orA)AG/GTA(orG)AGT "donor" splice
site T(orC)nNC(orT)AG/G "acceptor" splice
site
Gene finding programs look for different types of exon
single exon genes: begin with start codon & end with stop codon
initial exons: begin with start codon & end with donor site
internal exons: begin with acceptor & end with donor
terminal exons: begin with acceptor & end with stop codon
How are correct splice sites identified?
There are many occurrences of GT or AG within introns that are not splice sites
Statistical profiles of splice sites are used
http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm
Other Biologically Important Signals Used in Gene Finding Programs
Transcriptional Signals Transcription Start: characterized by cap signal
A single purine (A/G) TATA box (promoter) at –25 relative to start Polyadenylation signal: AATAAA (3’ end)
Major Caveat: not all genes have these signals
Makes it difficult to define the beginning and end of a gene
Upstream Promoter Sites
Transcription Factor (TF) sites Transcription factors are sequence-specific DNA-
binding proteins Bind to consensus DNA sequences e.g. CAAT transcription factor and CAAT box
Many of these Vary in sequence, location, interaction with
other sites Further complicates the problem of delineating
a “gene”
Translation Signals
Kozak sequence The signal for initiation of translation
in vertebrates Consensus is GCCACCatgG
And of course.. Translation stop codons
Codon Biasin Eukaryotic Genomes
Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)
Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)
GC Content in Eukaryotes
Overall GC content does not vary between species as it does in prokaryotes
GC content is still important in gene finding algorithms CpG Islands
CG dinucleotides occur at low frequency overall in the genome
Exception: CpG islands near promoters CG dinucleotides occur at level predicted by chance -1,500 to +500 (relative to transcription start site)
CpG Islands
Occurrence related to methylation Methylation of C in CG dinucleotides Methylation of C makes CpG prone
to mutation (e.g. to TpG or CpA) Level of methylation is low in
actively transcribed genes Transcription requires a methyl-free
promoter
Gene Finding Strategies
Homology-based approach Find sequences that are similar to
known gene sequences ab initio-based approach is to
identify genes by: Signal sequences Composition
List of Gene Finding Programs
http://www.hku.hk/bruhk/sggene.html
Homology-Based Approaches in Eukaryotic Genomes
More complicated than prokaryotes due to split genes
Genome sequence -> first identify all candidate exons
Use a spliced alignment algorithm to explore all possible exon assemblies & compare to known e.g. Procrustes
Limitations: must have similar sequence in the database with
known exon structure Sensitive to frame shift errors
Procrustes Gene Recognition via spliced alignment Given a genomic sequence and a set of
candidate exons, the spliced alignment algorithm explores all possible exon assemblies and finds a chain of exons with the best fit to a related target protein
http://hto-13.usc.edu/software/procrustes/#salign
GenScan
Allows integration of multiple types of information
Earlier programs considered features of gene structure in isolation
Uses a generalized HMM (one state might use a weight matrix model, another an HMM)
http://genes.mit.edu/GENSCAN.html
Probabilistic Model of Genes Accounts for many of the known structural &
compositional properties of genes including: typical gene density typical number of exons per gene distribution of exon sizes for different types of exon compositional properties of coding vs. non-coding translation initiation (Kozak) termination signals TATA box, cap site and poly-adenylation signals donor and acceptor splice sites
GenScan
GenScan Uses as a training set 238 multi-exon
genes and 142 single-exon genes from GenBank to compute parameters
Initial state probabilities Transition probabilities State length distributions
GenScan Probabilistic models for the states
The states correspond to different functional units on a gene e.g promoter regions, exon
Transitions ensure that the order that the model marches through the states is biologically consistent
Length distributions take into account that different functional units have different lengths.
GenScan Signal models used by GenScan - WMM= weight matrix model
for transcriptional and translational signals (translation initiation, polyadenylation signals, TATA box etc.) e.g. polyadenylation signal is modeled as a 6 bp WMM with AATAAA as the consensus sequence
-WAM= weight array model; assumes some dependencies between adjacent positions in the sequence e.g. used for the pyrimidine-rich region and the
splice acceptor site-Maximal dependency decomposition
e.g. used for donor splice sites
GenScan does not use similarity search uses double stranded genomic sequence
model potential genes on both strands are
analysed simultaneously
Limitations: cannot handle overlapping transcription unit does not address alternative splicing
GRAIL GRAIL (Gene Recognition and Assembly
Internet Link) uses a number of sensor algorithms to
evaluate coding potential of a DNA sequence features include 6-mer composition, GC
composition and splice junction recognition the output of the sensor algorithms is input
to a neural network, which uses empirical data for training.
GRAIL-exp
http://compbio.ornl.gov/grailexp/gxpfaq1.html