Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Sequence Motif Analysis
Lecture in M.Sc. Biomedizin,Module: ”Proteinbiochemie und Bioinformatik”
Jonas Ibn-Salem
Andrade groupJohannes Gutenberg University Mainz
Institute of Molecular Biology
March 7, 2016
Morgane Thomas-Chollier (ENS, Paris) kindly shared some of her slides
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 1 / 46
Overview
1 Sequence Motif ModelsConsensus sequenceMatrix representation
2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based
3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 2 / 46
Overview
1 Sequence Motif ModelsConsensus sequenceMatrix representation
2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based
3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 3 / 46
Motif analysis of transcription factor binding sites
Question
How do transcription factors (TF) know where to bind DNA?
� TFs recognize binding sites with specific DNA sequences.
� However, bound sequences are often slightly different.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46
Motif analysis of transcription factor binding sites
Question
How do transcription factors (TF) know where to bind DNA?
� TFs recognize binding sites with specific DNA sequences.
� However, bound sequences are often slightly different.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46
Motif analysis of transcription factor binding sites
Question
How do transcription factors (TF) know where to bind DNA?
� TFs recognize binding sites with specific DNA sequences.
� However, bound sequences are often slightly different.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46
Modelling the sequence specificity of TFs
Question
How can we describethe sequence specificityof a given TF?
Multiple alignment:
Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46
Modelling the sequence specificity of TFs
Question
How can we describethe sequence specificityof a given TF?
Multiple alignment:
Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46
Modelling the sequence specificity of TFs
Question
How can we describethe sequence specificityof a given TF?
Multiple alignment:
Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46
Motif Representations
Question
How is a motif represented / described?
� String-based� Strict consensus� Degenerate consensus
� Regular expressions
� Matrix based� Position frequency matrix (PFM)� Position probability matrix (PPM)� Position specific scoring matrix (PSSM)
� Sequence Logos
� Hidden Markov Models (HMM)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 6 / 46
Motif Representations
Question
How is a motif represented / described?
� String-based� Strict consensus� Degenerate consensus
� Regular expressions
� Matrix based� Position frequency matrix (PFM)� Position probability matrix (PPM)� Position specific scoring matrix (PSSM)
� Sequence Logos
� Hidden Markov Models (HMM)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 7 / 46
Consensus sequence
� A consensus sequence is a motifdescription as string(=sequence).
� A strict consensus sequence isderived from the collection ofbinding sites by taking thepredominant letter at eachposition (column) in thealignment.
� A degenerate consensussequence is defined by selectinga degeneracy nucleotide symbolfor each position.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 8 / 46
Matrix model of sequence motifs
Problem
TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.
� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.
� This is a more quantitative description of theknown binding sites.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46
Matrix model of sequence motifs
Problem
TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.
� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.
� This is a more quantitative description of theknown binding sites.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46
Conversion to Position Probability matrix (PPM)Position Frequency Matrix (PFM):
Position Probability Matrix (PPM):
Reference: [Hertz and Stormo, 1999] Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 10 / 46
Position specific scoring matrix (PSSM) or weight matrix
� To model the enrichment of an observed frequency over somebackground the PFM can be converted to a position specific scoringmatrix (PSSM) or weight matrix W .
Wk,j = log(Mk,j/bk)
where bk is a background probability for each base (under a Bernoullimodel).
� Usually a pseudo-count of +1 is added to the PFM before thetransformation to avoid the logarithm of zero.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 11 / 46
Sequence Logo
� A Sequence logo is a graphicalrepresentation of a motif.
� The hight of each letter is proportional tothe frequency of each residue at eachposition.
� The total height of each column isproportional to the sequence conservationand indicate the amount of informationcontained in each position (informationcontent measured in bits).
� Allows easy identification of the mostimportant positions in the motif.Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 12 / 46
Overview
1 Sequence Motif ModelsConsensus sequenceMatrix representation
2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based
3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 13 / 46
De novo Motif Discovery
Motif discovery problem
If there is a common regulating factor, can we discover its motif by using aset of functionally related sequences?
The input sequences might be
� Upstream sequences from co-expressedgenes
� Sequences of ChIP-seq peaks
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 14 / 46
Motif discovery using word counting
Idea
Since motifs corresponding to binding sites are generally repeated in thedataset, we can capture this signal of over-representation statistically.
� Binding sites are represented as ”words” == ”string” == ”k-mer”(e.g. acgtga is a 6-mer).
� Signal is likely to be more frequent in the region of interest than in arandom selection of regions.
� We can thus detect over-represented words.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 15 / 46
Algorithm for motif discovery using word counting
Algorithm
1 Count occurrences of all k-mers in a set of related sequences.
2 Estimate the expected number of occurrences from a backgroundmodel
� Empirical based on observed k-mer frequencies� Theoretical background model (Markov Models)
3 Statistical evaluation of the deviation observed (P-value/E-value).
4 Select all words above a defined threshold.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 16 / 46
Example of yeast under low nitrogen condition
� Expression of 7 yeast (Saccharomycescerevisiae) genes under low nitrogen conditions(NIT)
� Count all possible 6-mers in upstreamsequences.
Question:
� Does this mean that the word aaaaaa isenriched in upstream sequences of NIT genes?
Sequence freq.
aaaaaa|tttttt 80cttatc|gataag 26tatata|tatata 22ataaga|tcttat 20aagaaa|tttctt 20gaaaaa|tttttc 19atatat|atatat 19agataa|ttatct 17agaaaa|ttttct 17aaagaa|ttcttt 16aaaaca|tgtttt 16aaaaag|cttttt 15agaaga|tcttct 14tgataa|ttatca 14atataa|ttatat 14
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 17 / 46
Example of yeast under low nitrogen condition (part 2)
Problem
Upstream sequences have a compositional bias and are enriched forAT-rich sequences.
We need a background model to calculate the enrichment.Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 18 / 46
Empirical background based on observed k-mer frequences
Idea
Count the occurrences of all k-mers in upstream sequences of all genes
Sequence freq.
aaaaaa|tttttt 80cttatc|gataag 26tatata|tatata 22ataaga|tcttat 20aagaaa|tttctt 20gaaaaa|tttttc 19atatat|atatat 19agataa|ttatct 17agaaaa|ttttct 17aaagaa|ttcttt 16aaaaca|tgtttt 16aaaaag|cttttt 15agaaga|tcttct 14tgataa|ttatca 14atataa|ttatat 14
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 19 / 46
Estimation of background by Markov Models
Idea
Estimate the sequence composition using a statistical model.
� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).
� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244
� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216
� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.
� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )
� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46
Estimation of background by Markov Models
Idea
Estimate the sequence composition using a statistical model.
� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).
� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244
� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216
� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.
� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )
� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46
Estimation of background by Markov Models
Idea
Estimate the sequence composition using a statistical model.
� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).
� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244
� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216
� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.
� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )
� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46
Statistical evaluation of over representation
� Assume we found 18occurrences of the word ACGTGA
and estimated the expectednumber to be 2.95 using thebackground model.
Question:
� How big is the surprise to observe 18 occurrences when we expect2.95?
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 21 / 46
Statistical evaluation of over representation� At each position in the sequences, there is a probability p that the
word starting at this position is ACGTGA.� We consider n = 9000 positions.� From the background model we know the expected probabilityp = 2.95
9000 = 0.00032.� What is the probability that k = 18 of these n positions correspond toACGTGA?
� Binomial distribution to measure the significance:
P(X ≥ k) = 1−k∑
i=0
(n
i
)pi (1− p)n−i
In R:
> binom.test(x=18, n=9000, p=0.00033) ⇒ p-value = 3.029× 10−9
� Multiple testing correction by multiplying the p-value by the numberof tested words.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 22 / 46
Statistical evaluation of over representation� At each position in the sequences, there is a probability p that the
word starting at this position is ACGTGA.� We consider n = 9000 positions.� From the background model we know the expected probabilityp = 2.95
9000 = 0.00032.� What is the probability that k = 18 of these n positions correspond toACGTGA?
� Binomial distribution to measure the significance:
P(X ≥ k) = 1−k∑
i=0
(n
i
)pi (1− p)n−i
In R:
> binom.test(x=18, n=9000, p=0.00033) ⇒ p-value = 3.029× 10−9
� Multiple testing correction by multiplying the p-value by the numberof tested words.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 22 / 46
Assembling overlapping words
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 23 / 46
Motif discovery using matrices
Idea:
Find a matrix model that optimally describes a motif that is enriched inthe sequences.
� Comprehensive approach� Test all possible motif matrices that can build from the sequences.� Compute a score (e.g. information content, P-value) associated to
each motif matrix.� Report the highest scoring matrix.
Problem: The number of possible matrices is too large to becomputed in a reasonable time.
� Heuristics� Expectation-maximization (MEME)� Gibbs sampling (MotifSampler, info-gibbs,...)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 24 / 46
MEME (Multiple EM for Motif Elicitation)Expectation-maximization (EM)
� Instantiate a seed motif
� Iterate N times to enhance matrix:� Maximization: select the x highest scoring sites� Expectation: build a new matrix from the collected sites
Multiple EM
� Iterate over each k-mer fund in the input set
� Ranking / selection of the best matrices according to their scoreSource: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 25 / 46
Gibbs Sampling for stochastic optimisation
� Deterministic selection of the highest scoring site might result in localoptimum.
� Therefore, Gibbs sampling uses a random sampling step to betterexplore the ”search space” of possible solutions.
Idea:
� The idea in Gibbs sampling is to choose a site stochastically accordingto the score distribution to find the global optimum.
Source: https://c2.staticflickr.com/6/5289/5201851938_99424ccac6_b.jpg
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 26 / 46
Gibbs Sampling
� Step 0: Initialisation of the matrix and background model
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 27 / 46
Gibbs Sampling
� Step 1: Updating
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 28 / 46
Gibbs Sampling
� Step 2: Random sampling
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 29 / 46
Gibbs Sampling
� Step 2: random sampling
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 30 / 46
Gibbs Sampling
� Step 1: updating (2nd iteration)
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 31 / 46
Gibbs Sampling: Overview
0 Initialization� Select a random set of sites in the
sequence set� Create a matrix with these sites
1 Sampling (Stochastic Expectation)� Isolate one sequence from the set,
and score each position (site) ofthis sequence.
� Select one ”random” site, with aprobability proportional to thescore.
2 Predictive update (Maximization)� Replace the old site with a new
site, and update the matrix.
3 Iterate Step 1 and 2 for a fixednumber of cycles.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 32 / 46
De novo motif discovery with RSAT peak-motifs
source: [Thomas-Chollier et al., 2012]
� The RSAT tool suiteimplements a completepipeline for ChIP-seqbased motif analysis.
� For motif discovery ituses four complementaryalgorithms:
� Globalover-representation
� oligo-analysis� dyad-analysis
(spaced motifs)
� Positional bias� position-analysis� local-words
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 33 / 46
Overview
1 Sequence Motif ModelsConsensus sequenceMatrix representation
2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based
3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 34 / 46
Collections of Known Motifs
Idea
After de novo motif discovery, we might want to know to which TF itbelongs. We can compare a discovered motif to databases of known TFmotifs.
Many database of known TF motifs exist:
� Public (free accessible) and commercial databases
� General, organism, or experimental specific databases
� Often incomplete, redundant, and of heterogeneous qualitySource: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 35 / 46
TRANSFAC
� Commercial (BioBase�)
� Eukaryotic TF matrixmodels from bindingsites with evidence andpublications
� Pattern matching tools
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 36 / 46
JASPAR
� Public (http://jaspar.genereg.net/)
� Contain matrix modelsand access to sites fromwhich they wereconstructed
� Tools for patternmatching and matrixrandomization
Reference: [Mathelier et al., 2015]
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 37 / 46
Scanning sequences for motif hits
Pattern Matching Problem:
Given a known TF recognition motif and a sequence, we want to know allpositions where the motif matches the sequence.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 38 / 46
Motif discovery vs. Pattern Matching
Motif Discovery Problem:
Given a set of related sequences,identify common sequence motifs infrom these sequences
Input: Set of sequences
Output: Motif model
Example
� What is the sequence motif ofthe TF in a ChIP-seqexperiment?
� Are there other co-factorsinvolved?
� Which TF bind promoters ofsome co-expressed genes?
Pattern Matching Problem:
Given a known TF recognitionmotif and a sequence, we want toknow all positions where the motifmatches the sequence.
Input: Motif, Sequence
Output: Positions of motif hits
Example
� Where does a TF bind exactlyin a ChIP-seq peak region?
� Does a TF bind direct orindirect at a ChIP-seq peakregion?
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 39 / 46
Searching with Known Motif: String based
Problem:
Given a motif as consensus sequence, find all matches in an inputsequence.
� Compare each position along the sequence with the motif (scanning)and report matches.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 40 / 46
Searching with Known Motif: Matrix based
Problem:
Given a matrix motif model, find all significant hits (motif instances) in asequence.
� The similarity of a given sequence segment S with the motif can becomputed as weight score WS
WS = logP(S |M)
P(S |B)
� P(S |M) is the probability that the sequence segment S occursaccording to the motif model M. This can be simply computed fromthe matrix.
� P(S |B) is the probability of S under a background model B. Thiscan be a first order Markov model.
� The higher the score WS the more similar is the sequence S to thematrix model M.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 41 / 46
Probability of a sequence under the matrix model P(S |M)
� Given a matrix and sequence we can calculate
P(S |M) =w∏j=1
Mrj ,j
where M is the frequency matrix of width w , S = {r1, r2, ..., r2} is thesequence segment and rj the residue found at position j in thesequence S .
� Note, that usually a corrected matrix is used by adding pseudo-counts(e.g. +1) to avoid zeros in the calculation.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 42 / 46
Probability of a seq. under the background model P(S |B)
� The probability under the background model can be calculated as
P(S |B) =w∏j=1
prj
where prj is the probability of observing the residue rj according tothe background model.
� A background model (B) should estimate the probability of asequence motif in non-biding sites.
� Various possible background models. E.g.� Bernoulli model with residue-specific probabilities (pr )� Markov models to take dinucleotide frequencies into account (e.g
GC-bias)Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 43 / 46
Assigning a score to each segment and threshold usage
� The sequence is scanned with the matrix and a score is assigned toeach position.
� Resulting hits depend highly on the threshold on the weight score:� Stringent threshold ⇒ high confidence in predicted sites, but many
sites missed.� Loose threshold ⇒ real sides hidden in many false positives.
� Some programs compute theoretical p-values from the weight score.� A threshold on a p-value can be easier interpreted: For p = 10−4 we
expect 1 false prediction every 10.000 base pairs (on one strand).Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 44 / 46
Summary
� Transcription factors (TFs) recognize specific DNA sequence motifs.
� Sequence motifs can be represented as consensus sequence (string) ormore quantitatively as matrix model to capture the specificity of eachresidue at each position.
� A sequence logo plot is a graphical representation of a motif matrix.
� An enriched sequence motif can be detected in a set of functionallyrelated sequences using de novo motif discovery approaches.
� Known TF recognition motifs can be retrieved as matrix model fromonline databases.
� Pattern matching aims at finding putative TF binding sites (for whichthe binding motif is known) in DNA sequences.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 45 / 46
References
Hertz, G. Z. and Stormo, G. D. (1999).
Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics (Oxford, England), 15(7-8):563–77.
Mathelier, A., Fornes, O., Arenillas, D. J., Chen, C.-y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R.,
Zhang, A. W., Parcy, F., Lenhard, B., Sandelin, A., and Wasserman, W. W. (2015).JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.Nucleic acids research, 44(November 2015):gkv1176.
Stormo, G. D. (2013).
Modeling the specificity of protein-DNA interactions.Quantitative biology, 1(2):115–130.
Thomas-Chollier, M., Darbo, E., Herrmann, C., Defrance, M., Thieffry, D., and van Helden, J. (2012).
A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.Nature protocols, 7(8):1551–68.
Wasserman, W. W. and Sandelin, A. (2004).
Applied bioinformatics for the identification of regulatory elements.Nature reviews. Genetics, 5(4):276–87.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 46 / 46