Upload
vuxuyen
View
217
Download
0
Embed Size (px)
Citation preview
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 1
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 1
10/21/05
Gene Prediction
(formerly Gene Prediction - 3)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 2
AnnouncementsExam 2 - next FridayPosted online: Exam 2 Study Guide
544 Reading Assignment (2 papers)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 3
Announcements544 Semester Projects - Information needed:
Please send email to me (or David)[email protected]
Briefly describe:• Your background & current grad research• Is there a problem related to your research you would
like to learn more about & develop as project forthis course?
or• What would your ‘dream’ project be?
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 4
Announcements2 Bioinformatics Seminars today (Fri Oct 21)12:10 PM BCB Faculty Seminar in E164 Lagomarcino
“Protein Networks”Bob Jernigan, BBMB & Director,Baker Center
for Bioinformatics & Biological Statisticshttp://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2021
4:10 PM GDCB Special Seminar in 1414 MBB“Integrating the Unknown-eome with AbioticStress Response Networks in Arabidopsis”Ron Mittler, Dept. of Biochem & Mol BiologyUniversity of Nevada, Reno
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 5
Gene Prediction & Regulation
Mon - Gene structure review: Eukaryotes vs prokaryotes
Wed - Regulatory regions: Promoters & enhancers
Fri - Predicting genes - Predicting regulatory regions (?)
• Next week: Predicting RNA structure (miRNAs, too)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 6
Optional Reading
Reviews:
1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709
http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
2) Wasserman WW & Sandelin (2004) Applied bioinformatics for theidentification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 2
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 7
Review last lecture: Gene Regulation(formerly Gene Prediction-2)
cDNAs & ESTsUniGene
Regulatory regionsEukaryotes vs prokaryotes
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 8
DNA RNA
cDNA
Phenotypeprotein
[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance
Pevsner p160
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 9
UniGene: unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution
Pevsner p164 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 10
Today: Gene Prediction(formerly Gene Prediction - 3)
Predicting genes
Mon - Predicting regulatory regions Focus on promoters Introduction to RNA
Later: Genome browsers
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 11
Gene Prediction
• Overview of steps & strategies• What sequence signals can be used?• What other types of information can be used?
• Algorithms • HMMs, discriminant functions, neural nets
• Gene prediction software • 3 major types• many,many programs!
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 12
Predicting Genes - Basic steps:• Obtain genomic sequence• Translate in all 6 reading frames
• Compare with protein sequence database• Perform database similarity search with EST & cDNA databases, if available
• Use gene prediction program to locate genes• Analyze gene regulatory sequences
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 3
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 13
Overview of gene prediction strategies
What sequence signals can be used? Transcription: TF binding sites, promoter,
initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG)
ORFs, codon usageWhat other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 14
Automated gene prediction strategies
1) Similarity-based or Comparative• BLAST - Do other organisms have similar sequence?
(Is sequence similar to known gene or protein)
2) Ab initio = “from the beginning”• Predict without explicit comparison with cDNA or proteins via
“rule-based” gene models - but rules are derived fromstatistical analysis of datasets
3) Combined "evidence-based"• Combine gene models with alignment to known ESTs &
protein sequences
BEST RESULTS? Combined
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 15
Examples of gene prediction software
1) Similarity-based or Comparative• BLAST• SGP2 (extension of GeneID)
2) Ab initio = “from the beginning”• GeneID - (used in lab this week)• GENSCAN - (used in lab this week)• GeneMark.hmm - (should try this!)
3) Combined "evidence-based”• GeneSeqer (Brendel et al., ISU)
BEST? GENSCAN, GeneMark.hmm, GeneSeqerbut depends on organism & specific task
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 16
Gene prediction: Eukaryotes vs prokaryotes
Gene prediction is easier in microbial genomes
Why? Smaller genomesSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously, mostly HMM-based Now: similarity-based methods
because so many genomes available
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 17
GeneSeqer - Brendel et al.
http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 18
Thanks to Volker Brendel, ISUfor following Figs & Slides
Slightly modified from:BSSI Genome Informatics Module
http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB
V Brendel [email protected]
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 4
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 19
GT AG
exon intron
Splice sites
Donor site Acceptor site
Signals: Pre-mRNA Splicing
TranslationProtein
SplicingmRNA Cap- -Poly(A)
Transcriptionpre-mRNA Cap- -Poly(A)
Genomic DNA
Start codon Stop codon
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 20
Brendel - Spliced Alignment I:Compare with cDNA or EST probes
Genomic DNA
Start codon Stop codon
mRNA -Poly(A)Cap-
5’-UTR 3’-UTR
Start codon Stop codon
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 21
Brendel - Spliced Alignment II:Compare with protein probes
Genomic DNA
Start codon Stop codon
Protein
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 22
Brendel Spliced Alignment Algorithm
• Perform pairwise alignment with large gapsin one sequence (introns)• Align genomic DNA with cDNA, EST or protein
• Score semi-conserved sequences at splicejunctions
• Score coding constraints in translated exons
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 23
8888
8383
107107
104104
316316
311311
GTGT
AGAGZea maysZea mays
8653865386118611
9297929792479247
23019230192292922929
GTGTAGAG
ArabidopsisArabidopsisthalianathaliana
157157
163163
176176
172172
221221
217217
GTGT
AGAGAspergillusAspergillus
119119
118118
118118
122122
170170
179179
GTGT
AGAGS. S. pombepombe
20789207892062620626
20500205002032520325
37029370293686436864
GTGTAGAG
C. C. eleganselegans
524524
536536
670670
671671
989989
10011001
GTGT
AGAGDrosophilaDrosophila
107107103103
238238228228
288288284284
GTGTAGAG
Gallus gallusGallus gallus
147147140140
408408386386
450450442442
GTGTAGAG
Rattus norvegicusRattus norvegicus
521521
504504
11851185
11391139
12121212
11941194
GTGT
AGAGMus musculusMus musculus
3037303729792979
5277527751945194
6586658665556555
GTGTAGAG
Home sapiensHome sapiens
Number of True Splice Sites / PhaseNumber of True Splice Sites / Phase1 2 31 2 3
TypeTypeSpeciesSpecies
Donor (GT)Donor (GT) & Acceptor (AG) Sites& Acceptor (AG) SitesUsed for Model TrainingUsed for Model Training
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 24
• Information Content Information Content IIii ::
I f fi iB
B U C A G
iB= +!
"2 2
, , ,
log ( )
• Extent of Splice Signal Window:
I Ii I! + 196. "
i : ith position in sequenceĪ : average information content over all positions i > 20 nt from splice siteσĪ : average standard deviation of Ī
Splice Site Detection
Brendel 2005
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 5
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 25
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
HumanT2_GT
HumanFi_AG
HumanT2_AG
HumanF1_AG
A. thalianaT2_GT
A. thalianaF1_AG
A. thalianaFi_AG
A. thalianaT2_AG
Results?
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 26
Bayesian Splice Site Prediction
where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site
Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr
!=H
HPHSPHPHSPSHP }){}|{/(}{}|{}|{
11,/}{}|{}{}{
1
1
1!!""
+!=
!!
+!=
! ==iii s
r
li
slii
r
li
l ffspsspspSP
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 27
Bayes Factor as Decision Criterion
H0: H=T:}){1(
}{
})|{1(
}|{
Tp
Tp
STp
STpBF
!!=
- 2-class model: }|{}|{ FSpTSpBF =
- 7 class model: !!
!!
=
=
=
==
ix x
ix xx
x x
x xx
Fp
FpFSp
Tp
TpTSpBF
,0,2,1
,0,2,1
0,2,1
0,2,1
}{
}{}|{
}{
}{}|{
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 28
in terms of Critical Value in terms of Critical Value cc = 2 = 2 lnlnBFBF
•• Positive evidence for HPositive evidence for H00 if 2 if 2 ≤≤ c c ≤≤ 6 6
•• Strong support for H Strong support for H00 if 6 if 6 ≤≤ c c ≤≤ 10 10
•• Very strong support for H Very strong support for H00 if c > 10 if c > 10
Interpretation of Bayes FactorInterpretation of Bayes Factor
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 29
Evaluation of Splice Site Prediction
• Sensitivity: S TP APn= = !/ 1 "
• Normalized specificity: !"
" #=
$
$ +
1
1
• Specificity: S TP PPAN
PP rp = = ! =
!
! +/ 1
1
1"
#
# "r
AN
AP=
• Misclassification rates:! =FN
AP! =
FP
AN
= Coverage
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FN AN=FP+TN
PredictedTrue
False TNFNFPTP
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 30
48.173.291.041.962.081.2
93.297.699.392.396.498.6
99.595.687.199.296.487.1
036036
9027
10196
613
614
GT
AG
7CA. thaliana
40.464.385.458.276.988.5
92.797.199.197.298.899.5
97.894.284.898.896.290.2
036036
7460
10132
400
400
GT
AG
7CC. elegans
34.153.675.028.741.459.4
94.897.699.194.897.098.5
95.490.083.995.792.185.1
036036
11501
14920
329
329
GT
AG
2CDrosophila
16.434.857.6 9.715.725.6
90.596.398.588.492.996.1
98.591.766.396.390.376.1
036036
44411
65103
921
920
GT
AG
2CHomo sapiens
Sp(%)
σ
(%)Sn(%)
BayesFactor
Test Site Set True False
SiteModelSpecies
Brendel 2005
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 6
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 31
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
0.00
0.20
0.40
0.60
0.80
1.00
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
σ σ
SnSn
HumanGT site
HumanAG site
SnSn
C. elegansGT site
C. elegansAG site
SnSn
A. thalianaAG site
A. thalianaGT site
σ σ
σ σ
Brendel 2005
Performance?
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 32
en en+1
in in+1
PΔG
PA(n)PΔG
(1-PΔG)PD(n+1)
(1-PΔG)PD(n+1)
(1-PΔG)(1-PD(n+1))
1-PA(n)
PΔG
Markov Model for Spliced Alignment
Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 33
Performance vs other methods
• Comparison with ab initio gene predictionprograms?
• Depends on:• Availability of ESTs• Availability of protein homologs
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 34
Target protein alignment score
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100
Exon
(Sn
+ Sp
) / 2
GeneSeqer
NAPGENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs NAP vs GENSCAN (Exon prediction)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 35
0.000.100.200.300.400.500.600.700.800.901.00
0 10 20 30 40 50 60 70 80 90 100Target protein alignment score
Intro
n (S
n +
Sp) /
2
GeneSeqer
NAP
GENSCAN
Brendel 2005
GENSCAN - Burge, MIT
GeneSeqer vs NAP vs GENSCAN (Intron prediction)
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 36
GeneSeqerGenomicSequence
EST or proteindatabase
(Suffix Array/Suffix Tree)
Fast Search
SplicedAlignment
OutputAssembly
Brendel 2005
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 7
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 37Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 38Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 39Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 40Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 41Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 42Brendel 2005
Gene Prediction 10/21/05
D Dobbs ISU - BCB 444/544X 8
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 43
Gene Structure Annotation - Problems
False positive intergenic region:• 2 annotated genes actually correspond to a single gene
False negative intergenic region:• One annotated gene structure actually contains 2 genes
False negative gene prediction:• Missing gene (no annotation)
Other:• partially incorrect gene annotation• missing annotation of alternative transcripts
Brendel 2005 10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 44Brendel 2005
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction 45
Other ResourcesCurrent Protocols in Bioinformaticshttp://www.4ulr.com/products/currentprotocols/bioinformatics.html
Finding Genes4.1 An Overview of Gene Identification: Approaches, Strategies, and
Considerations4.2 Using MZEF To Find Internal Coding Exons4.3 Using GENEID to Identify Genes4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm4.6 Eukaryotic Gene Prediction Using GeneMark.hmm4.7 Application of FirstEF to Find Promoters and First Exons in the Human
Genome4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences