Upload
diana-taylor
View
214
Download
0
Embed Size (px)
Citation preview
Algorithms in Bioinformatics: A Practical Introduction
Project:Motif finding using ChIP-seq pea
k data
Transcriptional Control (I)
Transcriptional Control (II)
TATAAT is the motif!
Motif model
Motif can be described in two ways based on the binding sites discovered
ConsensusPattern
PositionalWeightMatrix (PWM)
TTGACA
TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA
nucleotide 1 2 3 4 5 6A 0.1 0 0 1 0.1 0.8C 0 0.1 0 0 0.9 0.1G 0.1 0 1 0 0 0T 0.8 0.9 0 0 0 0.1
alignment position
ChIP experiment Chromatin immunoprecipitation experimen
t Detect the interaction between protein (transcri
ption factor) and DNA.
Peak data Peak data represents the locations
where a particular TF binding. The data tells us the locations and
intensities. (Note that due to experimental error, peaks
of low intensity may be noise.)
chr1:883,686-958,485
ChIP-seq data forHuman (MCF7)E2 treatment at 45min
Our aim Given the DNA sequences of those peaks, find
motifs which occur in those peak regions. For the example below, we have two motifs:
TTGACA and GCATC. Note that each instance has at most 1 mutation.
GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGTGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATGGTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGTGAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT
Input (I) From every peak, we get approximately +/-200 DNA sequence
>cmyc_1_chr1_4842133_4842148_range_chr1_4841934_4842348_intensity_20CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCT
AAAACTGCTCAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAAAATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCTAAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTCCTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCATCTTTATTTACGAG
>cmyc_2_chr1_5073201_5073215_range_chr1_5073002_5073415_intensity_15GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCA
GTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCG
>cmyc_3_chr1_9530642_9530652_range_chr1_9530443_9530852_intensity_36GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCC
AAGTCCCGCCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGACAACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGTAGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTGGAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAGCTGGAGTTCGACCTGGGCAACC
……………
Input (II) A set of sequences which are likely containing no motif.
>SEQ_1AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTC
AGGACTGAGAAGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGAACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTTCACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGATGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGTTTTAAGGAAAA
>SEQ_2AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGT
TTTTAATTGTCAAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGAGTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACCCTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTCCCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGGGC
>SEQ_3CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACA
GTGCTTTCAATACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAATATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGAATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTCCAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA
……………
Output You need to output a list of candidate (ranked) motifs. You can model the motif as PWM or consensus
sequence.
If you model the motif as a PWM, one of the answer for the previous dataset is
You may also return other significant motifs.
Aim of the project Given a sample file and a background
file, you need to implement a method which
output a list of motifs.
You need to take advantage of the fact that this is a ChIP-seq dataset Hint: Read papers on ChIP-seq and
understand its properties.