54
Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: ”Proteinbiochemie und Bioinformatik” Jonas Ibn-Salem Andrade group Johannes Gutenberg University Mainz Institute of Molecular Biology March 7, 2016 Morgane Thomas-Chollier (ENS, Paris) kindly shared some of her slides Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 1 / 46

Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Sequence Motif Analysis

Lecture in M.Sc. Biomedizin,Module: ”Proteinbiochemie und Bioinformatik”

Jonas Ibn-Salem

Andrade groupJohannes Gutenberg University Mainz

Institute of Molecular Biology

March 7, 2016

Morgane Thomas-Chollier (ENS, Paris) kindly shared some of her slides

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 1 / 46

Page 2: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Overview

1 Sequence Motif ModelsConsensus sequenceMatrix representation

2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based

3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 2 / 46

Page 3: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Overview

1 Sequence Motif ModelsConsensus sequenceMatrix representation

2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based

3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 3 / 46

Page 4: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif analysis of transcription factor binding sites

Question

How do transcription factors (TF) know where to bind DNA?

� TFs recognize binding sites with specific DNA sequences.

� However, bound sequences are often slightly different.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46

Page 5: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif analysis of transcription factor binding sites

Question

How do transcription factors (TF) know where to bind DNA?

� TFs recognize binding sites with specific DNA sequences.

� However, bound sequences are often slightly different.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46

Page 6: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif analysis of transcription factor binding sites

Question

How do transcription factors (TF) know where to bind DNA?

� TFs recognize binding sites with specific DNA sequences.

� However, bound sequences are often slightly different.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 4 / 46

Page 7: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Modelling the sequence specificity of TFs

Question

How can we describethe sequence specificityof a given TF?

Multiple alignment:

Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46

Page 8: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Modelling the sequence specificity of TFs

Question

How can we describethe sequence specificityof a given TF?

Multiple alignment:

Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46

Page 9: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Modelling the sequence specificity of TFs

Question

How can we describethe sequence specificityof a given TF?

Multiple alignment:

Sequence motifs areshort, recurring patternsin DNA that arepresumed to have abiological function.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 5 / 46

Page 10: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif Representations

Question

How is a motif represented / described?

� String-based� Strict consensus� Degenerate consensus

� Regular expressions

� Matrix based� Position frequency matrix (PFM)� Position probability matrix (PPM)� Position specific scoring matrix (PSSM)

� Sequence Logos

� Hidden Markov Models (HMM)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 6 / 46

Page 11: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif Representations

Question

How is a motif represented / described?

� String-based� Strict consensus� Degenerate consensus

� Regular expressions

� Matrix based� Position frequency matrix (PFM)� Position probability matrix (PPM)� Position specific scoring matrix (PSSM)

� Sequence Logos

� Hidden Markov Models (HMM)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 7 / 46

Page 12: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Consensus sequence

� A consensus sequence is a motifdescription as string(=sequence).

� A strict consensus sequence isderived from the collection ofbinding sites by taking thepredominant letter at eachposition (column) in thealignment.

� A degenerate consensussequence is defined by selectinga degeneracy nucleotide symbolfor each position.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 8 / 46

Page 13: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Matrix model of sequence motifs

Problem

TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.

� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.

� This is a more quantitative description of theknown binding sites.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46

Page 14: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Matrix model of sequence motifs

Problem

TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.

� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.

� This is a more quantitative description of theknown binding sites.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46

Page 15: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Conversion to Position Probability matrix (PPM)Position Frequency Matrix (PFM):

Position Probability Matrix (PPM):

Reference: [Hertz and Stormo, 1999] Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 10 / 46

Page 16: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Position specific scoring matrix (PSSM) or weight matrix

� To model the enrichment of an observed frequency over somebackground the PFM can be converted to a position specific scoringmatrix (PSSM) or weight matrix W .

Wk,j = log(Mk,j/bk)

where bk is a background probability for each base (under a Bernoullimodel).

� Usually a pseudo-count of +1 is added to the PFM before thetransformation to avoid the logarithm of zero.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 11 / 46

Page 17: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Sequence Logo

� A Sequence logo is a graphicalrepresentation of a motif.

� The hight of each letter is proportional tothe frequency of each residue at eachposition.

� The total height of each column isproportional to the sequence conservationand indicate the amount of informationcontained in each position (informationcontent measured in bits).

� Allows easy identification of the mostimportant positions in the motif.Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 12 / 46

Page 18: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Overview

1 Sequence Motif ModelsConsensus sequenceMatrix representation

2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based

3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 13 / 46

Page 19: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

De novo Motif Discovery

Motif discovery problem

If there is a common regulating factor, can we discover its motif by using aset of functionally related sequences?

The input sequences might be

� Upstream sequences from co-expressedgenes

� Sequences of ChIP-seq peaks

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 14 / 46

Page 20: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif discovery using word counting

Idea

Since motifs corresponding to binding sites are generally repeated in thedataset, we can capture this signal of over-representation statistically.

� Binding sites are represented as ”words” == ”string” == ”k-mer”(e.g. acgtga is a 6-mer).

� Signal is likely to be more frequent in the region of interest than in arandom selection of regions.

� We can thus detect over-represented words.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 15 / 46

Page 21: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Algorithm for motif discovery using word counting

Algorithm

1 Count occurrences of all k-mers in a set of related sequences.

2 Estimate the expected number of occurrences from a backgroundmodel

� Empirical based on observed k-mer frequencies� Theoretical background model (Markov Models)

3 Statistical evaluation of the deviation observed (P-value/E-value).

4 Select all words above a defined threshold.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 16 / 46

Page 22: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Example of yeast under low nitrogen condition

� Expression of 7 yeast (Saccharomycescerevisiae) genes under low nitrogen conditions(NIT)

� Count all possible 6-mers in upstreamsequences.

Question:

� Does this mean that the word aaaaaa isenriched in upstream sequences of NIT genes?

Sequence freq.

aaaaaa|tttttt 80cttatc|gataag 26tatata|tatata 22ataaga|tcttat 20aagaaa|tttctt 20gaaaaa|tttttc 19atatat|atatat 19agataa|ttatct 17agaaaa|ttttct 17aaagaa|ttcttt 16aaaaca|tgtttt 16aaaaag|cttttt 15agaaga|tcttct 14tgataa|ttatca 14atataa|ttatat 14

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 17 / 46

Page 23: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Example of yeast under low nitrogen condition (part 2)

Problem

Upstream sequences have a compositional bias and are enriched forAT-rich sequences.

We need a background model to calculate the enrichment.Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 18 / 46

Page 24: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Empirical background based on observed k-mer frequences

Idea

Count the occurrences of all k-mers in upstream sequences of all genes

Sequence freq.

aaaaaa|tttttt 80cttatc|gataag 26tatata|tatata 22ataaga|tcttat 20aagaaa|tttctt 20gaaaaa|tttttc 19atatat|atatat 19agataa|ttatct 17agaaaa|ttttct 17aaagaa|ttcttt 16aaaaca|tgtttt 16aaaaag|cttttt 15agaaga|tcttct 14tgataa|ttatca 14atataa|ttatat 14

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 19 / 46

Page 25: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Estimation of background by Markov Models

Idea

Estimate the sequence composition using a statistical model.

� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).

� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244

� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216

� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.

� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )

� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46

Page 26: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Estimation of background by Markov Models

Idea

Estimate the sequence composition using a statistical model.

� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).

� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244

� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216

� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.

� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )

� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46

Page 27: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Estimation of background by Markov Models

Idea

Estimate the sequence composition using a statistical model.

� Bernoulli model models only the frequency of each single base:p(A), p(C ), p(G ), p(T ).

� Simplest model: p(A) = p(C ) = p(G ) = p(T ) = 0.25Example: p(ACGTGA) = 0.256 = 0.000244

� Frequencies in all upstream regions of yeast:p(A) = 0.3, p(C ) = 0.2, p(G ) = 0.2, p(T ) = 0.3Example: p(ACGTGA) = p(A)2 ∗ p(C ) ∗ p(G )2 ∗ p(T ) = 0.000216

� Markov model: The probability of each residue depends on the mpreceding residues. The parameter m is the order of the Markovmodel.

� Example Markov model order 1:p(ACGTGA) = p(A) ∗ p(C |A) ∗ p(G |C ) ∗ p(T |G ) ∗ p(G |T ) ∗ p(A|G )

� Example Markov model order 2:p(ACGTGA) = p(AC ) ∗ p(G |AC ) ∗ p(T |CG ) ∗ p(G |GT ) ∗ p(A|TG )

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 20 / 46

Page 28: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Statistical evaluation of over representation

� Assume we found 18occurrences of the word ACGTGA

and estimated the expectednumber to be 2.95 using thebackground model.

Question:

� How big is the surprise to observe 18 occurrences when we expect2.95?

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 21 / 46

Page 29: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Statistical evaluation of over representation� At each position in the sequences, there is a probability p that the

word starting at this position is ACGTGA.� We consider n = 9000 positions.� From the background model we know the expected probabilityp = 2.95

9000 = 0.00032.� What is the probability that k = 18 of these n positions correspond toACGTGA?

� Binomial distribution to measure the significance:

P(X ≥ k) = 1−k∑

i=0

(n

i

)pi (1− p)n−i

In R:

> binom.test(x=18, n=9000, p=0.00033) ⇒ p-value = 3.029× 10−9

� Multiple testing correction by multiplying the p-value by the numberof tested words.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 22 / 46

Page 30: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Statistical evaluation of over representation� At each position in the sequences, there is a probability p that the

word starting at this position is ACGTGA.� We consider n = 9000 positions.� From the background model we know the expected probabilityp = 2.95

9000 = 0.00032.� What is the probability that k = 18 of these n positions correspond toACGTGA?

� Binomial distribution to measure the significance:

P(X ≥ k) = 1−k∑

i=0

(n

i

)pi (1− p)n−i

In R:

> binom.test(x=18, n=9000, p=0.00033) ⇒ p-value = 3.029× 10−9

� Multiple testing correction by multiplying the p-value by the numberof tested words.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 22 / 46

Page 31: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Assembling overlapping words

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 23 / 46

Page 32: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif discovery using matrices

Idea:

Find a matrix model that optimally describes a motif that is enriched inthe sequences.

� Comprehensive approach� Test all possible motif matrices that can build from the sequences.� Compute a score (e.g. information content, P-value) associated to

each motif matrix.� Report the highest scoring matrix.

Problem: The number of possible matrices is too large to becomputed in a reasonable time.

� Heuristics� Expectation-maximization (MEME)� Gibbs sampling (MotifSampler, info-gibbs,...)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 24 / 46

Page 33: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

MEME (Multiple EM for Motif Elicitation)Expectation-maximization (EM)

� Instantiate a seed motif

� Iterate N times to enhance matrix:� Maximization: select the x highest scoring sites� Expectation: build a new matrix from the collected sites

Multiple EM

� Iterate over each k-mer fund in the input set

� Ranking / selection of the best matrices according to their scoreSource: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 25 / 46

Page 34: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling for stochastic optimisation

� Deterministic selection of the highest scoring site might result in localoptimum.

� Therefore, Gibbs sampling uses a random sampling step to betterexplore the ”search space” of possible solutions.

Idea:

� The idea in Gibbs sampling is to choose a site stochastically accordingto the score distribution to find the global optimum.

Source: https://c2.staticflickr.com/6/5289/5201851938_99424ccac6_b.jpg

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 26 / 46

Page 35: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling

� Step 0: Initialisation of the matrix and background model

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 27 / 46

Page 36: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling

� Step 1: Updating

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 28 / 46

Page 37: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling

� Step 2: Random sampling

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 29 / 46

Page 38: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling

� Step 2: random sampling

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 30 / 46

Page 39: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling

� Step 1: updating (2nd iteration)

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 31 / 46

Page 40: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Gibbs Sampling: Overview

0 Initialization� Select a random set of sites in the

sequence set� Create a matrix with these sites

1 Sampling (Stochastic Expectation)� Isolate one sequence from the set,

and score each position (site) ofthis sequence.

� Select one ”random” site, with aprobability proportional to thescore.

2 Predictive update (Maximization)� Replace the old site with a new

site, and update the matrix.

3 Iterate Step 1 and 2 for a fixednumber of cycles.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 32 / 46

Page 41: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

De novo motif discovery with RSAT peak-motifs

source: [Thomas-Chollier et al., 2012]

� The RSAT tool suiteimplements a completepipeline for ChIP-seqbased motif analysis.

� For motif discovery ituses four complementaryalgorithms:

� Globalover-representation

� oligo-analysis� dyad-analysis

(spaced motifs)

� Positional bias� position-analysis� local-words

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 33 / 46

Page 42: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Overview

1 Sequence Motif ModelsConsensus sequenceMatrix representation

2 De novo Motif DiscoveryMotif discovery problemWord countingMatrix-based

3 Known Motifs and Pattern MatchingCollections of Known MotifsScanning sequences for motif hits

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 34 / 46

Page 43: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Collections of Known Motifs

Idea

After de novo motif discovery, we might want to know to which TF itbelongs. We can compare a discovered motif to databases of known TFmotifs.

Many database of known TF motifs exist:

� Public (free accessible) and commercial databases

� General, organism, or experimental specific databases

� Often incomplete, redundant, and of heterogeneous qualitySource: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 35 / 46

Page 44: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

TRANSFAC

� Commercial (BioBase�)

� Eukaryotic TF matrixmodels from bindingsites with evidence andpublications

� Pattern matching tools

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 36 / 46

Page 45: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

JASPAR

� Public (http://jaspar.genereg.net/)

� Contain matrix modelsand access to sites fromwhich they wereconstructed

� Tools for patternmatching and matrixrandomization

Reference: [Mathelier et al., 2015]

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 37 / 46

Page 46: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Scanning sequences for motif hits

Pattern Matching Problem:

Given a known TF recognition motif and a sequence, we want to know allpositions where the motif matches the sequence.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 38 / 46

Page 47: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Motif discovery vs. Pattern Matching

Motif Discovery Problem:

Given a set of related sequences,identify common sequence motifs infrom these sequences

Input: Set of sequences

Output: Motif model

Example

� What is the sequence motif ofthe TF in a ChIP-seqexperiment?

� Are there other co-factorsinvolved?

� Which TF bind promoters ofsome co-expressed genes?

Pattern Matching Problem:

Given a known TF recognitionmotif and a sequence, we want toknow all positions where the motifmatches the sequence.

Input: Motif, Sequence

Output: Positions of motif hits

Example

� Where does a TF bind exactlyin a ChIP-seq peak region?

� Does a TF bind direct orindirect at a ChIP-seq peakregion?

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 39 / 46

Page 48: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Searching with Known Motif: String based

Problem:

Given a motif as consensus sequence, find all matches in an inputsequence.

� Compare each position along the sequence with the motif (scanning)and report matches.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 40 / 46

Page 49: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Searching with Known Motif: Matrix based

Problem:

Given a matrix motif model, find all significant hits (motif instances) in asequence.

� The similarity of a given sequence segment S with the motif can becomputed as weight score WS

WS = logP(S |M)

P(S |B)

� P(S |M) is the probability that the sequence segment S occursaccording to the motif model M. This can be simply computed fromthe matrix.

� P(S |B) is the probability of S under a background model B. Thiscan be a first order Markov model.

� The higher the score WS the more similar is the sequence S to thematrix model M.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 41 / 46

Page 50: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Probability of a sequence under the matrix model P(S |M)

� Given a matrix and sequence we can calculate

P(S |M) =w∏j=1

Mrj ,j

where M is the frequency matrix of width w , S = {r1, r2, ..., r2} is thesequence segment and rj the residue found at position j in thesequence S .

� Note, that usually a corrected matrix is used by adding pseudo-counts(e.g. +1) to avoid zeros in the calculation.

Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 42 / 46

Page 51: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Probability of a seq. under the background model P(S |B)

� The probability under the background model can be calculated as

P(S |B) =w∏j=1

prj

where prj is the probability of observing the residue rj according tothe background model.

� A background model (B) should estimate the probability of asequence motif in non-biding sites.

� Various possible background models. E.g.� Bernoulli model with residue-specific probabilities (pr )� Markov models to take dinucleotide frequencies into account (e.g

GC-bias)Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 43 / 46

Page 52: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Assigning a score to each segment and threshold usage

� The sequence is scanned with the matrix and a score is assigned toeach position.

� Resulting hits depend highly on the threshold on the weight score:� Stringent threshold ⇒ high confidence in predicted sites, but many

sites missed.� Loose threshold ⇒ real sides hidden in many false positives.

� Some programs compute theoretical p-values from the weight score.� A threshold on a p-value can be easier interpreted: For p = 10−4 we

expect 1 false prediction every 10.000 base pairs (on one strand).Source: Morgane Thomas-Chollier (ENS, Paris)

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 44 / 46

Page 53: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

Summary

� Transcription factors (TFs) recognize specific DNA sequence motifs.

� Sequence motifs can be represented as consensus sequence (string) ormore quantitatively as matrix model to capture the specificity of eachresidue at each position.

� A sequence logo plot is a graphical representation of a motif matrix.

� An enriched sequence motif can be detected in a set of functionallyrelated sequences using de novo motif discovery approaches.

� Known TF recognition motifs can be retrieved as matrix model fromonline databases.

� Pattern matching aims at finding putative TF binding sites (for whichthe binding motif is known) in DNA sequences.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 45 / 46

Page 54: Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"

References

Hertz, G. Z. and Stormo, G. D. (1999).

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics (Oxford, England), 15(7-8):563–77.

Mathelier, A., Fornes, O., Arenillas, D. J., Chen, C.-y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R.,

Zhang, A. W., Parcy, F., Lenhard, B., Sandelin, A., and Wasserman, W. W. (2015).JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.Nucleic acids research, 44(November 2015):gkv1176.

Stormo, G. D. (2013).

Modeling the specificity of protein-DNA interactions.Quantitative biology, 1(2):115–130.

Thomas-Chollier, M., Darbo, E., Herrmann, C., Defrance, M., Thieffry, D., and van Helden, J. (2012).

A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.Nature protocols, 7(8):1551–68.

Wasserman, W. W. and Sandelin, A. (2004).

Applied bioinformatics for the identification of regulatory elements.Nature reviews. Genetics, 5(4):276–87.

Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 46 / 46