Many similarly expressed genes are coregulated by the same transcription factor(s) …

Many similarly expressed genes are coregulated by the same transcription factor(s) …

Therefore, can search promoters of coregulated genes for binding sites

Genes induced by carbon starvation

1



ORFsUpstream regionGenes induced by carbon starvation

2



ORFsUpstream region

Similar sequence found in most upstream regions(here = CCAAT which = Hap4p binding site)

Genes induced by carbon starvation

3

Finding sequence motifs common to a group of ‘similar’ sequences

ORFsUpstream region

Similar sequence found in most upstream regions

How do you identify motifs in sequence data?

How can you tell if the identified motif is ‘significant’?

How do you find genomic examples of the identified motif? 4

A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G

IUPAC consensus: W G A T G G A T N G

Site 1Site 2Site 3Site 4Site 5Site 6Site 7

(where W = A or T)

First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)

5


G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0


PWM represents frequencies of each base at each position in the motif *

* These days, PWM/PSSM can correspond to the frequency matrix or a likelihood matrix

First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)

6

Web-logo: A graphical representation of PWMs

http://weblogo.berkeley.edu/

Height of the base proportional to frequency of base on that position …more specifically known as “bits” , “information content” , or “entropy”

7

Information content IC

The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.

If using log2, the info content is in ‘bits’

ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C

Information Content at position i:

Where Pb(i) is the probability of base b at position i

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

8



If using log2, the info content is in ‘bits’

ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C

Information Content at position i:

Where Pb(i) is the probability of base b at position i

Maximum IC if P of some base is 1.0: = 2 + [ (1.0 * 0) + 0 + 0 + 0 ] = 2

Minimum IC if P is 0.25 for all bases: = 2 + [0.25(-2) ] * 4 = 0

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

9

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0



IC 1.0 2.0 2.0 2.0 1.1 2.0 2.0 2.0 0.5 1.3 = bit score of 15.9

Position

bits

Information Profile:

10

Position

bits

Often for protein-DNA interactions, IC profile is smooth

bits

Position

Real motif Randomized data

11

12

One limitation of PWMs: each position is considered independently(does not represent inter-dependencies across motif positions)

13

Gary Stormo, Nat Biotech 2011

Morris et al. , Nat Biotech 2011

Finding matches to (instances of) a PWM

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

b = G,A,T,C i

Joint probability: assuming each position is independent,

P(motif) Pb(i)

P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048

14

Is the sequence A G A T T G A T C T a match to this matrix?


G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Is the sequence A G A T T G A T C T a match to this matrix?

b = G,A,T,C i


P(motif) Pb(i)

P(sequence | background model ) = (0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25) = 6.8e-24

P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048

Background model:P(G,A,T,C) = 0.25

15

Log-likelihood ratio LLR

= log ( P(sequence | matrix model ) / P(sequence | background model ) )

A measure of how different the likelihood of the sequence is, given themotif model vs. the background model.

In our example:

LLR = log ( 0.0048 / 6.8e-24 ) = 20.8

The larger the LLR, the more likely the motif model is the right one.To select motifs in real life, can define a LLR cutoff (often defined by sampling).

16


G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Is the sequence A A A T T G A T C T a match to this matrix?

b = G,A,T,C i


P(motif) Pb(i)

P(sequence | matrix model ) = (0.4)(0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0

** If your PWM was trained on a small sample set, you might have missed some examples= overfitting of the matrix (ie. too specific) 17

Pseudo-counts: protecting against overfitting due to small sample sizes


G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0


Add 1 count to each base at each position, then divide by n + 4

Without pseudo-counts:

18

Motif finding methods and algorithms

Given a set of n promoters of n coregulated genes, find a motif common to the promoters.Both the PWM and the motif sequences are unknown.

Common methods:1. Enumeration:

Simplest case: look at the frequency of all n-mers* Finds Global Optimum since can search entire space

2. EM algorithms (MEME): Iteratively hone in on the most likely motif model – can simultaneouslyidentify the motif and find examples of the motif

3. Gibbs sampling methods (AlignAce, BioProspector)Iteratively replace (‘sample’) sites to retrain the matrix

19

Documents

Many similarly expressed genes are coregulated by the same transcription factor(s) …