1
Central Dogma of Molecular Biology
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.)
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
• To “first order” the products are proteins
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
• To “first order” the products are proteins and the two-stage process
involves
• transcription: an imprint of the DNA is taken by mRNA
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
• To “first order” the products are proteins and the two-stage process
involves
• transcription: an imprint of the DNA is taken by mRNA
• translation: the mRNA is used to guide the assembly of proteins
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
• To “first order” the products are proteins and the two-stage process
involves
• transcription: an imprint of the DNA is taken by mRNA
• translation: the mRNA is used to guide the assembly of proteins
• Protein production can be regulated by transcription factors which
• bind to specific DNA sites (Transcription Factor Binding Sites)
1
Central Dogma of Molecular Biology
• DNA is merely the blueprint
• Shared spatially (eyes, ears, heart etc.) and temporally (from cradle
to tomb)
• What changes is what exactly, and how much, each cell “produces”
at any given moment
• To “first order” the products are proteins and the two-stage process
involves
• transcription: an imprint of the DNA is taken by mRNA
• translation: the mRNA is used to guide the assembly of proteins
• Protein production can be regulated by transcription factors which
• bind to specific DNA sites (Transcription Factor Binding Sites)
• regulate transcription rate of proximal genes
2
Transcription initiation of the glnA gene in E. coli
3
Motif finding
• Do these sequences share a common TFBS?
• tagcttcatcgttgacttctgcagaaagcaagctcctgagtagctggccaagcgagctgcttgtgcccggctgcggcggttgtatcctgaatacgccatgcgccctgcagctgctagaccctgcagccagctgcgcctgatgaaggcgcaacacgaaggaaagacgggaccagggcgacgtcctattaaaagataatcccccgaacttcatagtgtaatctgcagctgctcccctacaggtgcaggcacttttcggatgctgcagcggccgtccggggtcagttgcagcagtgttacgcgaggttctgcagtgctggctagctcgacccggattttgacggactgcagccgattgatggaccattctattcgtgacacccgacgagaggcgtccccccggcaccaggccgttcctgcaggggccaccctttgagttaggtgacatcattcctatgtacatgcctcaaagagatctagtctaaatactacctgcagaacttatggatctgagggagaggggtactctgaaaagcgggaacctcgtgtttatctgcagtgtccaaatcctat
3
Motif finding
• Do these sequences share a common TFBS?
• tagcttcatcgttgacttctgcagaaagcaagctcctgagtagctggccaagcgagctgcttgtgcccggctgcggcggttgtatcctgaatacgccatgcgccctgcagctgctagaccctgcagccagctgcgcctgatgaaggcgcaacacgaaggaaagacgggaccagggcgacgtcctattaaaagataatcccccgaacttcatagtgtaatctgcagctgctcccctacaggtgcaggcacttttcggatgctgcagcggccgtccggggtcagttgcagcagtgttacgcgaggttctgcagtgctggctagctcgacccggattttgacggactgcagccgattgatggaccattctattcgtgacacccgacgagaggcgtccccccggcaccaggccgttcctgcaggggccaccctttgagttaggtgacatcattcctatgtacatgcctcaaagagatctagtctaaatactacctgcagaacttatggatctgagggagaggggtactctgaaaagcgggaacctcgtgtttatctgcagtgtccaaatcctat
4
If only life could be that simple
• The binding sites are almost never excatly the same
• A more likely sample is:
tagcttcatcgttgactttTGaAGaaagcaagctcctgagtagctggccaagcgagctgcttgtgcccggctgcggcggttgtatcctgaatacgccatgcgccCTGgAGctgctagaccCTGCAGccagctgcgcctgatgaaggcgcaacacgaaggaaagacgggaccagggcgacgtcctattaaaagataatcccccgaacttcatagtgtaatCTGCAGctgctcccctacaggtgcaggcacttttcggatgCTGCttcggccgtccggggtcagttgcagcagtgttacgcgaggttCTaCAGtgctggctagctcgacccggattttgacggaCTGCAGccgattgatggaccattctattcgtgacacccgacgagaggcgtccccccggcaccaggccgttcCTaCAGgggccaccctttgagttaggtgacatcattcctatgtacatgcctcaaagagatctagtctaaatactacCTaCAGaacttatggatctgagggagaggggtactctgaaaagcgggaacctcgtgtttattTGCAttgtccaaatcctat
5
The dual face of motif finding
• Motif finding really consists of two problems:
5
The dual face of motif finding
• Motif finding really consists of two problems:
• finding the most pronounced motifs in the text
5
The dual face of motif finding
• Motif finding really consists of two problems:
• finding the most pronounced motifs in the text
• statistical significance: are they merely artifacts of the size of the
data?
5
The dual face of motif finding
• Motif finding really consists of two problems:
• finding the most pronounced motifs in the text
• statistical significance: are they merely artifacts of the size of the
data?
• In the remaining few minutes I will touch on the second problem
6
Assessing the significance of a putative motif
6
Assessing the significance of a putative motif
• Begin with the aligning the motif occurrences:
tTGaAGCTGgAGCTGCAGCTGCAGCTGCttCTaCAGCTGCAGCTaCAGCTaCAGtTGCAt
6
Assessing the significance of a putative motif
• Begin with the aligning the motif occurrences:
tTGaAGCTGgAGCTGCAGCTGCAGCTGCttCTaCAGCTGCAGCTaCAGCTaCAGtTGCAt
then create the alignment matrix:
A 3 1 9
C 8 8
G 7 1 8
T 2 10 1 2
6
Assessing the significance of a putative motif
• Begin with the aligning the motif occurrences:
tTGaAGCTGgAGCTGCAGCTGCAGCTGCttCTaCAGCTGCAGCTaCAGCTaCAGtTGCAt
then create the alignment matrix:
A 3 1 9
C 8 8
G 7 1 8
T 2 10 1 2
which you then summarize with the
entropy score:
•I :=
∑column i
∑letter j
nij lognij/n
qj(n = 10)
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
• The solution is to assess the statistical significance of s
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
• The solution is to assess the statistical significance of s
• This is often accomplished by computing the p-value of the
observed score:
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
• The solution is to assess the statistical significance of s
• This is often accomplished by computing the p-value of the
observed score:
. assuming the observed columns are randomly drawn from
the background frequencies {qa, qc, qg, qt}
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
• The solution is to assess the statistical significance of s
• This is often accomplished by computing the p-value of the
observed score:
. assuming the observed columns are randomly drawn from
the background frequencies {qa, qc, qg, qt}. what is P0(I ≥ s)?
7
What’s in a score?
• By itself, the entropy score s of a particular motif has limited use
• we cannot compare scores of alignments with varying depth or
width
• The solution is to assess the statistical significance of s
• This is often accomplished by computing the p-value of the
observed score:
. assuming the observed columns are randomly drawn from
the background frequencies {qa, qc, qg, qt}. what is P0(I ≥ s)?
• This is not as simple as it might look at first sight
8
Can I submit two different answers?
MEME E-values compared with Consensus E-values (log10 scale)
8
Can I submit two different answers?
MEME E-values compared with Consensus E-values (log10 scale)
• MEME is consistently pessimistic when compared with Consensus (by
a factor of over 500 at times)
8
Can I submit two different answers?
MEME E-values compared with Consensus E-values (log10 scale)
• MEME is consistently pessimistic when compared with Consensus (by
a factor of over 500 at times)
• Who’s right?
9
Our work
9
Our work
• We developed a method that borrows ideas from large-deviation theory
to compute a reliable answer reasonably fast
9
Our work
• We developed a method that borrows ideas from large-deviation theory
to compute a reliable answer reasonably fast
• The same underlying idea can be used for other fundamental statistical
problems
9
Our work
• We developed a method that borrows ideas from large-deviation theory
to compute a reliable answer reasonably fast
• The same underlying idea can be used for other fundamental statistical
problems
• Joint work with: Neil Jones (Ph.D. student at UCSD)
9
Our work
• We developed a method that borrows ideas from large-deviation theory
to compute a reliable answer reasonably fast
• The same underlying idea can be used for other fundamental statistical
problems
• Joint work with: Neil Jones (Ph.D. student at UCSD), Niranjan
Nagarajan (Ph.D. student here)