7
Promoter Prediction (really) 10/26/05 D Dobbs ISU - BCB 444/544X 1 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 1 10/26/05 Promoter Prediction (really!) 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 2 Announcements BCB Link for Seminar Schedules (updated) http://www.bcb.iastate.edu/seminars/index.html Seminar (Fri Oct 28) 12:10 PM BCB Faculty Seminar in E164 Lagomarcino Assembly and Alignment of Genomic DNA Sequence Xiaoqiu Huang, ComS http://www. bcb . iastate .edu/courses/BCB691-F2005.html#Oct%2028 Mark your calendars: 1:10 PM Nov 14 Baker Seminar in Howe Hall Auditorium "Discovering transcription factor binding sites" Douglas Brutlag,Dept of Biochemistry & Medicine Stanford University School of Medicine 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 3 Announcements BCB 544 Projects - Important Dates: Nov 2 Wed noon - Project proposals due to David/Drena Nov 4 Fri 10A - Approvals/responses to students Dec 2 Fri noon - Written project reports due Dec 5,7,8,9 class/lab - Oral Presentations (20') (Dec 15 Thurs = Final Exam) 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 4 Announcements Lab 9 - due Wed noon (today) Exam 2 - this Friday Posted Online: Exam 2 Study Guide 544 Reading Assignment (2 papers) Lab Keys (today) Thurs No Lab - Extra Office Hrs instead: David 1-3 PM in 209 Atanasoff Drena 1-3 PM in 106 MBB 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 5 Promoter Prediction RNA Structure/Function Prediction Mon Quite a few more words re: Gene prediction Wed Promoter prediction next Mon: RNA structure & function RNA structure prediction 2' & 3' structure prediction miRNA & target prediction 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 6 Optional - but very helpful reading: 1) Zhang MQ (2002) Computational prediction of eukaryotic protein- coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html 2) Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html 03489059922 (that's a hint!) Check this out: http://www.phylofoot.org/NRG_testcases/

Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 1

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 1

10/26/05

Promoter Prediction(really!)

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 2

Announcements• BCB Link for Seminar Schedules (updated)

http://www.bcb.iastate.edu/seminars/index.html

Seminar (Fri Oct 28)

12:10 PM BCB Faculty Seminar in E164 LagomarcinoAssembly and Alignment of Genomic DNA SequenceXiaoqiu Huang, ComShttp://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2028

Mark your calendars:1:10 PM Nov 14 Baker Seminar in Howe Hall Auditorium

"Discovering transcription factor binding sites"Douglas Brutlag,Dept of Biochemistry & MedicineStanford University School of Medicine

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 3

Announcements

BCB 544 Projects - Important Dates:

Nov 2 Wed noon - Project proposals due to David/Drena

Nov 4 Fri 10A - Approvals/responses to students

Dec 2 Fri noon - Written project reports due

Dec 5,7,8,9 class/lab - Oral Presentations (20')

(Dec 15 Thurs = Final Exam)

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 4

Announcements

Lab 9 - due Wed noon (today)Exam 2 - this Friday

Posted Online: Exam 2 Study Guide 544 Reading Assignment (2 papers) Lab Keys (today)

Thurs No Lab - Extra Office Hrs instead: David 1-3 PM in 209 Atanasoff Drena 1-3 PM in 106 MBB

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 5

Promoter PredictionRNA Structure/Function Prediction

Mon Quite a few more words re:Gene prediction

Wed Promoter prediction

next Mon: RNA structure & functionRNA structure prediction

2' & 3' structure prediction

miRNA & target prediction

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 6

Optional - but very helpful reading:

1) Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709

http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html

2) Wasserman WW & Sandelin A (2004) Applied bioinformatics foridentification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

03489059922

(that's a hint!)

Check this out: http://www.phylofoot.org/NRG_testcases/

Page 2: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 2

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 7

Reading Assignment (for Mon)

Mount Bioinformatics• Chp 8 Prediction of RNA Secondary Structure• pp. 327-355• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html

Cates (Online) RNA Secondary Structure Prediction Module• http://cnx.rice.edu/content/m11065/latest/

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 8

Review last lecture:

Flowchart for Gene Prediction

Performance Assessment MeasuresCorrection re: slide 10/24 # 27

Promoters

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 9

Gene prediction flowchart

Fig 5.15Baxevanis &Ouellette 2005 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 10

Evaluation of Splice Site Prediction

Fig 5.11Baxevanis &Ouellette 2005

What do measures really mean?

Sp =

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 11

Correction re: last lecture:

GeneSeqer Performance Graphs

Brendel et al (2004) Bioinformatics 20: 1157

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 12

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

σ σ

SnSn

HumanGT site

HumanAG site

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

0.00

0.20

0.40

0.60

0.80

1.00

-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

SnSn

A. thalianaAG site

A. thalianaGT site

σ σ

Brendel 2005

Performance?

Note: these are not ROC curves (plots of (1-Sn) vs Sp)• But plots such as these (& ROCs) much better than

using "single number" to compare different methods• Both types of plots illustrate trade-off: Sn vs Sp

Page 3: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 3

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 13

Fig 2 - Brendel et al (2004) Bioinformatics 20: 1157

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 14

Bayes Factor as Decision Criterion

H0: H=T:}){1(

}{

})|{1(

}|{

Tp

Tp

STp

STpBF

!!=

2-class model: }|{}|{ FSpTSpBF =

7 class model: !!

!!

=

=

=

==

ix x

ix xx

x x

x xx

Fp

FpFSp

Tp

TpTSpBF

,0,2,1

,0,2,1

0,2,1

0,2,1

}{

}{}|{

}{

}{}|{

Brendel 2005

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 15

Evaluation of Splice Site Prediction

• Normalized specificity: !"

" #=

$

$ +

1

1

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FN AN=FP+TN

PredictedTrue

False TNFNFPTP

Brendel 2005

• Specificity: S TP PPAN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "r

AN

AP=S TP PP

AN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "S TP PP

AN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "

• Misclassification rates: ! =FN

AP! =

FP

AN

• Sensitivity: S TP APn= = !/ 1 " = CoverageS TP APn= = !/ 1 "

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 16

Careful: different definitions for "Specificity"Actual

True FalsePP=TP+FP

PN=FN+TN

AP=TP+FN AN=FP+TN

PredictedTrue

False TNFNFPTP

• Specificity: S TP PPAN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "r

AN

AP=S TP PP

AN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "S TP PP

AN

PP rp = = ! =

!

! +/ 1

1

1"

#

# "

• Sensitivity: S TP APn= = !/ 1 " = CoverageS TP APn= = !/ 1 "

cf. Guig�ó definitionsSn: Sensitivity = TP/(TP+FN)Sp: Specificity = TN/(TN+FP) = Sp-

AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) +(TN/(TN+FP)) + (TN/(TN+FN))) - 1

Other measures? Predictive Values, Correlation Coefficient

Brendel definitions

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 17

Best measures for comparing different methods?

• ROC curves (Receiver Operating Characteristic?!!)

http://www.anaesthetist.com/mnm/stats/roc/

"The Magnificent ROC" - has fun applets & quotes:"There is no statistical test, however intuitive and simple,which will not be abused by medical researchers"

• Correlation Coefficient(Matthews correlation coefficient (MCC)

MCC = 1 for a perfect prediction 0 for a completely random assignment -1 for a "perfectly incorrect" prediction

Do not memorize this!

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 18

PromotersWhat signals are there?

Simple ones in prokaryotes

BIOS Scientific Publishers Ltd, 1999Brown Fig 9.17

Page 4: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 4

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 19

Prokaryotic promoters

• RNA polymerase complex recognizes promotersequences located very close to & on 5’ side(“upstream”) of initiation site

• RNA polymerase complex binds directly to these.with no requirement for “transcription factors”

• Prokaryotic promoter sequences are highly conserved• -10 region• -35 region

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 20

What signals are there? Complex ones in eukaryotes!

Fig 9.13Mount 2004

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 21

Simpler view of complex promoters in eukaryotes:

Fig 5.12Baxevanis &Ouellette 2005 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 22

Eukaryotic genes are transcribed by3 different RNA polymerases

BIOS Scientific Publishers Ltd, 1999Brown Fig 9.18

Recognize different types of promoters & enhancers:

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 23

Eukaryotic promoters & enhancers

• Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!)

• Enhancers also required for regulated transcription(these control expression in specific cell types, developmentalstages, in response to environment)

• RNA polymerase complexes do not specificallyrecognize promoter sequences directly

• Transcription factors bind first and serve as“landmarks” for recognition by RNA polymerasecomplexes

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 24

Eukaryotic transcription factors

• Transcription factors (TFs) are DNA binding proteinsthat also interact with RNA polymerase complex toactivate or repress transcription

• TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039

• TFs recognize specific short DNA sequence motifs“transcription factor binding sites”

• Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

Page 5: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 5

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 25

Zinc finger-containing transcription factors• Common in eukaryotic proteins

• Estimated 1% of mammaliangenes encode zinc-fingerproteins

• In C. elegans, there are 500!

• Can be used as highly specificDNA binding modules

BIOS Scientific Publishers Ltd, 1999Brown Fig 9.12

• Potentially valuable toolsfor directed genomemodification (esp. in plants)& human gene therapy

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 26

New Today: Promoter Prediction

Predicting regulatory regions (focus on promoters) Brief review promoters & enhancers

Predicting promoters: eukaryotes vs prokaryotes

Next week: RNA structure & function

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 27

Predicting Promoters

• Overview of strategies What sequence signals can be used?• What other types of information can be used?

• Algorithms • Promoter prediction software

• 3 major types• many, many programs!

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 28

Promoter prediction: Eukaryotes vs prokaryotes

Promoter prediction is easier in microbial genomes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously, again mostly HMM-based Now: similarity-based. comparative methods

because so many genomes available

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 29

Predicting promoters: Steps & Strategies

Closely related to gene prediction!• Obtain genomic sequence• Use sequence-similarity based comparison

(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-conserved than coding regions

• Locate ORFs• Identify TSS (if possible!)• Use promoter prediction programs• Analyze motifs, etc. in sequence (TRANSFAC)

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 30

Predicting promoters: Steps & Strategies

Identify TSS --if possible?• One of biggest problems is determining exact TSS!

Not very many full-length cDNAs!• Good starting point? (human & vertebrate genes)

Use FirstEFfound within UCSC Genome Browseror submit to FirstEF web server

Fig 5.10Baxevanis &Ouellette 2005

Page 6: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 6

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 31

Automated promoter prediction strategies

1) Pattern-driven algorithms

2) Sequence-driven algorithms

3) Combined "evidence-based"

BEST RESULTS? Combined, sequential

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 32

Promoter Prediction: Pattern-driven algorithms

• Success depends on availability of collections ofannotated binding sites (TRANSFAC & PROMO)

• Tend to produce huge numbers of FPs

• Why?• Binding sites (BS) for specific TFs often variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other proteins) influence

affinity & specificity of TF binding• One binding site often recognized by multiple BFs• Biology is complex: promoters often specific to

organism/cell/stage/environmental condition

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 33

Promoter Prediction: Pattern-driven algorithms

Solutions to problem of too many FP predictions?• Take sequence context/biology into account

• Eukaryotes: clusters of TFBSs are common• Prokaryotes: knowledge of σ factors helps

• Probability of "real" binding site increases ifannotated transcription start site (TSS) nearby• But: What about enhancers? (no TSS nearby!)

& Only a small fraction of TSSs have beenexperimentally mapped

• Do the wet lab experiments!• But: Promoter-bashing is tedious

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 34

Promoter Prediction: Sequence-driven algorithms

• Assumption: common functionality can be deducedfrom sequence conservation• Alignments of co-regulated genes should highlight

elements involved in regulationCareful: How determine co-regulation?• Orthologous genes from difference species• Genes experimentally determined to be

co-regulated (using microarrays??)• Comparative promoter prediction:

"Phylogenetic footprinting" - more later….

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 35

Problems:• Need sets of co-regulated genes• For comparative (phylogenetic) methods

• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional elements• If background conservation of entire region is highly

conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)

• Biology is complex: many (most?) regulatory elementsare not conserved across species!

Promoter Prediction: Sequence-driven algorithms

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 36

Examples of promoterprediction/characterization software

Lab: used MATCH, MatInspectorTRANSFACMEME & MASTBLAST, etc.

Others?FIRST EFDragon Promoter Finder (these are links in PPTs)

also see Dragon Genome Explorer (has specializedpromoter software for GC-rich DNA, finding CpGislands, etc)

JASPAR

Page 7: Promoter Prediction (really) 10/26/05 - Iowa State Universitycs544/Lectures... · 2005-10-26 · 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (realy!) 11 Correction re:

Promoter Prediction (really) 10/26/05

D Dobbs ISU - BCB 444/544X 7

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 37

TRANSFAC matrix entry: for TATA box

Fields:• Accession & ID•Brief description•TFs associatedwith this entry•Weight matrix•Number of sitesused to build(How many here?)•Other info

Fig 5.13Baxevanis &Ouellette 2005 10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 38

Global alignment of human & mouse obesegene promoters (200 bp upstream from TSS)

Fig 5.14Baxevanis &Ouellette 2005

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 39

Check out optional review &try associated tutorial:

Wasserman WW & Sandelin A (2004) Applied bioinformatics foridentification of regulatory elements. Nat Rev Genet 5:276-287http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html

Check this out: http://www.phylofoot.org/NRG_testcases/

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 40

Annotated lists of promoter databases &promoter prediction software

• URLs from Mount Chp 9, available onlineTable 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html

• Table in Wasserman & Sandelin Nat Rev Genet articlehttp://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm

• URLs for Baxevanis & Ouellette, Chp 5:http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links

More lists:• http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promo

ter• http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104• http://www3.oup.co.uk/nar/database/subcat/1/4/

10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!) 41

Reading Assignment (for Mon)

Mount Bioinformatics• Chp 8 Prediction of RNA Secondary Structure• pp. 327-355• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html

Cates (Online) RNA Secondary Structure Prediction Module• http://cnx.rice.edu/content/m11065/latest/