11
Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP- seq peak data

Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Embed Size (px)

Citation preview

Page 1: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Algorithms in Bioinformatics: A Practical Introduction

Project:Motif finding using ChIP-seq pea

k data

Page 2: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Transcriptional Control (I)

Page 3: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Transcriptional Control (II)

TATAAT is the motif!

Page 4: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Motif model

Motif can be described in two ways based on the binding sites discovered

ConsensusPattern

PositionalWeightMatrix (PWM)

TTGACA

TTGACATCGACATTGACATTGAAAATGACATTGACAGTGACATTGACTTTGACCTTGACA

nucleotide 1 2 3 4 5 6A 0.1 0 0 1 0.1 0.8C 0 0.1 0 0 0.9 0.1G 0.1 0 1 0 0 0T 0.8 0.9 0 0 0 0.1

alignment position

Page 5: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

ChIP experiment Chromatin immunoprecipitation experimen

t Detect the interaction between protein (transcri

ption factor) and DNA.

Page 6: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Peak data Peak data represents the locations

where a particular TF binding. The data tells us the locations and

intensities. (Note that due to experimental error, peaks

of low intensity may be noise.)

chr1:883,686-958,485

ChIP-seq data forHuman (MCF7)E2 treatment at 45min

Page 7: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Our aim Given the DNA sequences of those peaks, find

motifs which occur in those peak regions. For the example below, we have two motifs:

TTGACA and GCATC. Note that each instance has at most 1 mutation.

GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGTGCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCGCCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATGGTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAGCAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGCGTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGTGAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT

Page 8: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Input (I) From every peak, we get approximately +/-200 DNA sequence

>cmyc_1_chr1_4842133_4842148_range_chr1_4841934_4842348_intensity_20CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAACACAGCCTTTATATTTTGATATGCCT

AAAACTGCTCAATGGCTGGGCCACTTCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAGTCATTACTTACTAAAATCTGTAATTACATCTTTGCTGCCCTAGGCCCAATCTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCTAAGCTCTCTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCAAACATGGTGGCTCTCCTTCCTCCTTCTTCCTGTCTGTCCCCAGCCTGGGAATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGCATCTTTATTTACGAG

>cmyc_2_chr1_5073201_5073215_range_chr1_5073002_5073415_intensity_15GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCA

GTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCG

>cmyc_3_chr1_9530642_9530652_range_chr1_9530443_9530852_intensity_36GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCGGGCCAACATCCGGTGACGAATCC

AAGTCCCGCCTCTAAGCCCATCTGCTGTCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCGCTTCCAGACAACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGGACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGTAGTGGCGATACGTCCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGAGGGCCAGCGCGTGGAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGGAGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAGCTGGAGTTCGACCTGGGCAACC

……………

Page 9: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Input (II) A set of sequences which are likely containing no motif.

>SEQ_1AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAGTTGCTGTTAGCTAAGACAGTC

AGGACTGAGAAGGGGGGGGGGGGTTTAACTCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAGCCGAGAACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCCCGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTTCACACAAGGCGTGAACACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAGAGCCAAGATCGATGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGAGGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGTTTTAAGGAAAA

>SEQ_2AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGGAGATTAGGGCTGGATGACAAGT

TTTTAATTGTCAAGGACTCAATTCTGTTTATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGAGCAGTGTAGAGAGTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAGAACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACCCTGGGCTAGCATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAAGAAGGACCTGCGTTCCCTCCCTCCTTGTGATCAGCTGATTGCTAGAGCATAACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGGGC

>SEQ_3CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGTACCCTTTCAGGAGCCTAAAACA

GTGCTTTCAATACTTGTGTCTATGTCTGTTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGTATTTGCTCAAATATATATTCATATGGTGAGGTGCACATTTTTTATATTATATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGAATAAAGCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCAGAGGCAGACAAATCTCTGCGTTCCAGGACAGCCTGGTCTATAGAGCAAGGTCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA

……………

Page 10: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Output You need to output a list of candidate (ranked) motifs. You can model the motif as PWM or consensus

sequence.

If you model the motif as a PWM, one of the answer for the previous dataset is

You may also return other significant motifs.

Page 11: Algorithms in Bioinformatics: A Practical Introduction Project: Motif finding using ChIP-seq peak data

Aim of the project Given a sample file and a background

file, you need to implement a method which

output a list of motifs.

You need to take advantage of the fact that this is a ChIP-seq dataset Hint: Read papers on ChIP-seq and

understand its properties.