Upload
august
View
28
Download
0
Embed Size (px)
DESCRIPTION
Many similarly expressed genes are coregulated by the same transcription factor(s) … Therefore, can search promoters of coregulated genes for binding sites. Genes induced by carbon starvation. Many similarly expressed genes are coregulated by the same transcription factor(s) … - PowerPoint PPT Presentation
Citation preview
Many similarly expressed genes are coregulated by the same transcription factor(s) …
Therefore, can search promoters of coregulated genes for binding sites
Genes induced by carbon starvation
1
Many similarly expressed genes are coregulated by the same transcription factor(s) …
Therefore, can search promoters of coregulated genes for binding sites
ORFsUpstream regionGenes induced by carbon starvation
2
Many similarly expressed genes are coregulated by the same transcription factor(s) …
Therefore, can search promoters of coregulated genes for binding sites
ORFsUpstream region
Similar sequence found in most upstream regions(here = CCAAT which = Hap4p binding site)
Genes induced by carbon starvation
3
Finding sequence motifs common to a group of ‘similar’ sequences
ORFsUpstream region
Similar sequence found in most upstream regions
How do you identify motifs in sequence data?
How can you tell if the identified motif is ‘significant’?
How do you find genomic examples of the identified motif? 4
A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G
IUPAC consensus: W G A T G G A T N G
Site 1Site 2Site 3Site 4Site 5Site 6Site 7
(where W = A or T)
First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)
5
A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
Site 1Site 2Site 3Site 4Site 5Site 6Site 7
PWM represents frequencies of each base at each position in the motif *
* These days, PWM/PSSM can correspond to the frequency matrix or a likelihood matrix
First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)
6
Web-logo: A graphical representation of PWMs
http://weblogo.berkeley.edu/
Height of the base proportional to frequency of base on that position …more specifically known as “bits” , “information content” , or “entropy”
7
Information content IC
The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.
If using log2, the info content is in ‘bits’
ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C
Information Content at position i:
Where Pb(i) is the probability of base b at position i
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
8
Information content IC
The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.
If using log2, the info content is in ‘bits’
ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C
Information Content at position i:
Where Pb(i) is the probability of base b at position i
Maximum IC if P of some base is 1.0: = 2 + [ (1.0 * 0) + 0 + 0 + 0 ] = 2
Minimum IC if P is 0.25 for all bases: = 2 + [0.25(-2) ] * 4 = 0
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
9
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
Information content IC
The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.
IC 1.0 2.0 2.0 2.0 1.1 2.0 2.0 2.0 0.5 1.3 = bit score of 15.9
Position
bits
Information Profile:
10
Position
bits
Often for protein-DNA interactions, IC profile is smooth
bits
Position
Real motif Randomized data
11
12
One limitation of PWMs: each position is considered independently(does not represent inter-dependencies across motif positions)
13
Gary Stormo, Nat Biotech 2011
Morris et al. , Nat Biotech 2011
Finding matches to (instances of) a PWM
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
b = G,A,T,C i
Joint probability: assuming each position is independent,
P(motif) Pb(i)
P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048
14
Is the sequence A G A T T G A T C T a match to this matrix?
Finding matches to (instances of) a PWM
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
Is the sequence A G A T T G A T C T a match to this matrix?
b = G,A,T,C i
Joint probability: assuming each position is independent,
P(motif) Pb(i)
P(sequence | background model ) = (0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25) = 6.8e-24
P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048
Background model:P(G,A,T,C) = 0.25
15
Log-likelihood ratio LLR
= log ( P(sequence | matrix model ) / P(sequence | background model ) )
A measure of how different the likelihood of the sequence is, given themotif model vs. the background model.
In our example:
LLR = log ( 0.0048 / 6.8e-24 ) = 20.8
The larger the LLR, the more likely the motif model is the right one.To select motifs in real life, can define a LLR cutoff (often defined by sampling).
16
Finding matches to (instances of) a PWM
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
Is the sequence A A A T T G A T C T a match to this matrix?
b = G,A,T,C i
Joint probability: assuming each position is independent,
P(motif) Pb(i)
P(sequence | matrix model ) = (0.4)(0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0
** If your PWM was trained on a small sample set, you might have missed some examples= overfitting of the matrix (ie. too specific) 17
Pseudo-counts: protecting against overfitting due to small sample sizes
A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G
G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8
A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0
Site 1Site 2Site 3Site 4Site 5Site 6Site 7
Add 1 count to each base at each position, then divide by n + 4
Without pseudo-counts:
18
Motif finding methods and algorithms
Given a set of n promoters of n coregulated genes, find a motif common to the promoters.Both the PWM and the motif sequences are unknown.
Common methods:1. Enumeration:
Simplest case: look at the frequency of all n-mers* Finds Global Optimum since can search entire space
2. EM algorithms (MEME): Iteratively hone in on the most likely motif model – can simultaneouslyidentify the motif and find examples of the motif
3. Gibbs sampling methods (AlignAce, BioProspector)Iteratively replace (‘sample’) sites to retrain the matrix
19