27
Regulatory Motifs

Regulatory Motifs

  • Upload
    asis

  • View
    35

  • Download
    2

Embed Size (px)

DESCRIPTION

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME. What Makes Regulatory Motifs Important? The DNA sequence in every cell of an individual is identical. - PowerPoint PPT Presentation

Citation preview

Page 1: Regulatory Motifs

Regulatory Motifs

Page 2: Regulatory Motifs

Contents

• Biology of regulatory motifs

• Experimental discovery

• Computational discovery

• PSSM

• MEME

Page 3: Regulatory Motifs

What Makes Regulatory Motifs Important?

• The DNA sequence in every cell of an individual is identical.

• The regulation mechanism makes the difference- determines which genes are transcribed and under which conditions.

• One of the most heavily regulated processes in the cell is gene transcription.

• The major regulation point within the transcription process is the regulation of transcription initiation, regulated by Transcription Factors (TFs).

Page 4: Regulatory Motifs

Transcription Factors & Regulatory

Motifs• TFs are proteins that bind to short DNA

sequences, named regulatory motifs.• TFs may act as: activators - upon binding enable the transcription

of the neighboring gene. repressors -upon binding prevent transcription.

May have a different effect on different genes.• Regulatory motifs are typically 6-20 nucleotides

long. • Usually found in the vicinity of the gene they

regulate, mostly upstream.

Transcription Start Site

SBFMCM1 Gene X

Page 5: Regulatory Motifs

Facts • There are many types of TFs.• Each TF can affect many genes.• Each gene may be regulated by several TFs.• TFs may act in combinations. Example: Two TFs

must bind the upstream region of a gene in order to activate its transcription.

• The regulatory motif that bind a TF is not exact; few mismatches are very common.

and Challenges1. How to represent a regulatory motif?2. Can we identify new sites of known motifs in

genome sequences?3. Can we discover new motifs within upstream

sequences of genes?

Page 6: Regulatory Motifs

1. Motif Representation• Exact motif: AACTTG• Consensus: represent only deterministic

nucleotides. Example: HAP1 binding sites in 5 sequences.

CGGATATACCGGCGGTGATAGCGGCGGTACTAACGGCGGCGGTAACGGCGGCCCTAACGG-------------------------CGGNNNTANCGG <- HAP1 consensus motif N – stands for any nucleotide.

Representing consensus only, loses information. How can this be avoided?

Page 7: Regulatory Motifs

PSPM – Position Specific Probability Matrix

• Represents a motif of length k as a family of k-mers. • Defines Pi(A,C,G,T) for i={1,..,k} based on the

frequency of each nucleotide in each position.• Each k-mer is assigned a probability. Example:

P(TCCAG)=0.5*0.25*0.8*0.7*0.2

• What is the consensus? 1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Page 8: Regulatory Motifs

Graphical Representation – Sequence Logo

Horizontal axis: position of the base in the sequence.

Vertical axis: amount of information.

Letter stack: order indicates importance.

Letter height: indicates frequency.

Consensus can be read across the top of the letter columns.

Page 9: Regulatory Motifs

2. Identification of Known Motifs within Genomic Sequences

• The known motif binds a known TF.

• Searching for new binding sites will enable the identification of new genes controlled by the same TF.

• Can hint of the function of these genes; enable better understanding of the regulation

mechanism.

• Can be achieved experimentally or computationally.

Page 10: Regulatory Motifs

Experimental Identification

• Experimental methods include location analysis, mutations in the motif region, and more.

• These methods require a-priori knowledge of either the motif, or its location in the DNA sequence, or of the regulatory protein that binds to it.

• Experimental identification of an unknown regulatory motif without such prior knowledge is currently not possible.

Page 11: Regulatory Motifs

Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence, 1 position at a time. At each position the sub-sequence is scored for a match to the PSPM.

Example: sequence = ATGCAAGTCT… Position 1: ATGCA Position 2: TGCAA A T G C A A T G C A A

0.1*0.25*0.1*0.1*0.6=1.5*10-4 0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Page 12: Regulatory Motifs

Detecting a Known Motif within a Sequence using PSSM

• The position that gave the maximal score represents the best match for the motif.

• Is is a random match, or is it indeed an occurrence of the motif?

• The PSPM is turned into PSSM- odds score matrix: Oi(A,C,G,T) for i={1,..,k} is the ratio between Pi(A,C,G,T) for i={1,..,k} and the background frequency of each nucleotide.

• As Oi (N) increase, the odds that N (at position i) is part of a real motif increase.

Page 13: Regulatory Motifs

PSSM as Odds Score MatrixAssumption: the background frequency of each

nucleotide is 0.25.Original PSPM (Pi)

Odds Matrix (Oi)

Going to log scale we get an additive score.Log odds Matrix (log2Oi)

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

Page 14: Regulatory Motifs

Calculating using Log Odds MatrixExample: sequence = ATGCAAGTCT…

Position 1: ATGCA -1.32+0-1.32-1.32+1.26=-2.7 odds= 0.15

Position 2: TGCAA 1+0+1.68+1.48+1.26 =5.42 odds=42.8

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Page 15: Regulatory Motifs

Building a PSSM

• Collect all known sequences that bind a certain TF.

• Align all sequences (using multiple sequence alignment).

• Compute the frequency of each nucleotide in each position (PSPM).

• Incorporate background frequency for each nucleotide (PSSM).

Page 16: Regulatory Motifs

Current Results:

• When searching for a motif in a genome using PSSM or other methods –

the motif is usually found all over the place! The motif is considered real if found in the

vicinity of a gene.

• Checking experimentally for the binding sites of a specific TF (location analysis) –

the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

• Current thinking –TFs work in combination with other TFs.

Page 17: Regulatory Motifs

3. Finding new Motifs

We are given a group of genes, which presumably contain a common regulatory motif.

We know nothing of the TF that binds to the putative motif.

The problem: discover the motif.

Page 18: Regulatory Motifs

Defining Co-regulated Sequences

There are several methods to discover groups ofgenes that have a putative common regulator.

Page 19: Regulatory Motifs

Defining Co-regulated Sequences

1. Genes that are co-expressed – clustered together in gene expression data.

2. Genes coding for proteins that participate in a common pathway.

3. Genes related by comparative genomics methods such as conserved operons, protein fusion, and correlated evolution.

4. Orthologous genes from multiple species (homologous sequences belonging to different species).

Page 20: Regulatory Motifs

Computational Identification

We have n DNA sequences, each of length m, and look for a regulatory motif of length k.

Simple solution: Exhaustive search We search for all possible motifs of length k.

There are 4k possible motifs.

Initial problems:• How shall we treat inexact motifs?• Assume the genome contains a lot of As (e.g.,

yeast). Is a k-mer that is A rich a regulatory motif?

Page 21: Regulatory Motifs

Difficulties in Computational Identification

• Each motif can appear in any of m-k columns;there are (m-k)n possibilities.

• Noise:Mismatches are allowed, the motif is not exact.Not all sequences contain the motif.

• Statistical significance:k is short (6-20 nucleotides).m ranges from 10s (prokaryotes) to 1000s (eukaryotes) of nucleotides.=> a random motif can appear by chance in sequences.

Page 22: Regulatory Motifs

Computational Methods

• This problem has received a lot of attention from CS people.

• Methods include: Probabilistic methods – hidden Markov models

(HMMs), expectation maximization (EM), Gibbs sampling, etc.Enumeration methods – problematic for inexact motifs of length k>10. …

• Current status: Problem is still open. The detection of real motifs within the best 20

putative motifs is considered success. Many tests are done on synthetic data.

Page 23: Regulatory Motifs

Tools on the Web1. AlignACE – Aligns nucleic Acids Conserved

Elements.http://atlas.med.harvard.edu/download/

2. MEME – Multiple Em for Motif Elicitation.http://meme.sdsc.edu/meme/website/

3. eMotif - allows to scan, make and search for motifs.http://motif.stanford.edu/emotif

4. TRANSFAC - database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human. http://transfac.gbf.de/TRANSFAC/

Page 24: Regulatory Motifs
Page 25: Regulatory Motifs
Page 26: Regulatory Motifs
Page 27: Regulatory Motifs

sample MEME output: http://meme.sdsc.edu/meme/website/meme-output-example.html