March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali

March 03March 03

Identification of Identification of Transcription Factor Transcription Factor

Binding SitesBinding Sites

Presenting:Presenting:

Mira & TaliMira & Tali

GoalGoalAGCCA

AGCCA

AGCCA

AGCCA

AGCCA

AGCCA

Regulatory Regulatory regionsregions

Motif – Motif –

Binding site???Binding site???

Why Bother?Why Bother?

Gene expression regulation

Co-regulation

UNDERSTAND

DifficultiesDifficulties

Multiple factors for a single geneMultiple factors for a single gene

Variability in binding sitesVariability in binding sites The nature of variability is NOT well understoodThe nature of variability is NOT well understood Usually Transitions Usually Transitions Insertions and deletions are uncommonInsertions and deletions are uncommon

Location, location, location…Location, location, location…

EMSA – Electrophoretic mobility shift EMSA – Electrophoretic mobility shift assayassay

Nuclease protection assay Nuclease protection assay

Experimental methodsExperimental methods

NOT ENOUGH!!!!!

So, what can we do?So, what can we do?

Find conserved sequences in Find conserved sequences in regulation regionsregulation regions

1. Define what you want to find1. Define what you want to find

2. Define what is a good result2. Define what is a good result

3. Decide how to find it…3. Decide how to find it…

Global optimum Global optimum Enumerative methodsEnumerative methods

Going over ALL possibilitiesGoing over ALL possibilities

Taking the best oneTaking the best one

Principal Methods:Principal Methods:

Disadvantage :

Limited to small search spaces

Advantage :

Certainty

Principal Methods:Principal Methods:

Disadvantage :

You can never know…

Advantage :

Basically good results, faster

Local optimum Local optimum Gibbs sampling, AlignACEGibbs sampling, AlignACE

Start somewhere (arbitrary)Start somewhere (arbitrary)Next step direction – proportional to what Next step direction – proportional to what we “gain” from itwe “gain” from itWe can get We can get anywhereanywhere with some with some probabilityprobability

Identifying motifsIdentifying motifs Expression patterns Expression patterns Phylogenetic footprintingPhylogenetic footprinting

Identifying networksIdentifying networks Common motifs in expression clustersCommon motifs in expression clusters Combinatorial analysisCombinatorial analysis

Articles OverviewArticles Overview

Discovery of novel trancription Discovery of novel trancription factor binding sites by statistical factor binding sites by statistical

overrepresentationoverrepresentationS. Sinha, M. TompaS. Sinha, M. Tompa

Identify binding sites in yeast

Goal:

Use sets of co-regulated genes

Identify over-represented

upstream sequences

Enumeration YMF algorithm

What constitutes a motif?What constitutes a motif?(tailored for S.cerevisiae)(tailored for S.cerevisiae)

In S.cerevisiae typically 6-10 In S.cerevisiae typically 6-10 conserved bases – The motifconserved bases – The motif

Spacers varying in length (1-11bp)Spacers varying in length (1-11bp) Usually located in the middle Usually located in the middle

Taken from SCPD – S.cerevisiae promoter database

ACCNNNNNNGTT

Z-score – Z-score – MotifMotif over-representationover-representation

PPmaxmax(X) – (X) – Probability of ZProbability of Zscorescore >= X >= X

How do we measure motifs?How do we measure motifs?

YMF algorithmYMF algorithmYeast Motif FinderYeast Motif Finder

INPUT:

A set of promoter regions Motif length -

l

• modest values

Maximum number of spacers allowed - w

Transition Matrix

6

11

YMF algorithmYMF algorithm

Post Processing:

FindExplanators:

artificial overrepresentation

W-score

Co-expression score

TCACGCT (motif)

CACGCTA (artifact)

ExperimentsExperiments

Validate YMF resultsValidate YMF results Running YMF on regulons with known Running YMF on regulons with known

binding sites (SCPD)binding sites (SCPD)

Run YMF on MIPS catalogsRun YMF on MIPS catalogs(MIPS - Munich Information center for Protein Sequences)(MIPS - Munich Information center for Protein Sequences)

Functional Functional Mutant phenotypeMutant phenotype

Validation Validation

New binding sitesNew binding sites or false positives? or false positives?

A novel site candidateA novel site candidate

Further researchFurther research

Validation of novel binding sites and Validation of novel binding sites and transcription factorstranscription factors

Modification of the algorithm to be Modification of the algorithm to be applicable for other organismsapplicable for other organisms

Systematic determination of Systematic determination of genetic network architecturegenetic network architectureSaeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho, Saeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho,

George M. ChurchGeorge M. Church

Cluster by expression patterns

Identify upstream sequence patterns

Identify co- regulated networks of genes in yeast

Goal:

AlignACE

Aligns Nucleic Acid Conserved Elements

ClustersClusters

Cluster – a group of genes with a Cluster – a group of genes with a similar expression patternsimilar expression pattern

Cluster’s members Cluster’s members Tend to participate in common Tend to participate in common

processesprocesses Tend to be co-regulatedTend to be co-regulated

ClustersClusters 10-54

Identifying motifsIdentifying motifs Using AlignACE Using AlignACE

18 motifs from 18 motifs from 12 clusters 12 clusters were found.were found.

7 of the found 7 of the found motifs were motifs were identified identified experimentally experimentally

And what about the And what about the others????others????

Scanning for more binding Scanning for more binding sitessites

Once a significant motif was found Once a significant motif was found the whole genome was scanned for itthe whole genome was scanned for it

Most motifs were cluster specificMost motifs were cluster specific

Why so few motifs?Why so few motifs?

Too stringent rules for defining a Too stringent rules for defining a “significant” motif“significant” motif

Post transcriptional regulation (mRNA Post transcriptional regulation (mRNA stability)stability)

Some clusters represent “noise”Some clusters represent “noise”

““Tightness”Tightness”

““Tightness” of a clusterTightness” of a cluster how close are the cluster members of a how close are the cluster members of a

particular cluster to its mean particular cluster to its mean

A strong correlation between the A strong correlation between the presence of significant motifs and presence of significant motifs and the “tightness” of a clusterthe “tightness” of a cluster

Things to rememberThings to remember

Discovering regulons and motifs using Discovering regulons and motifs using expression based clusteringexpression based clustering

Minimal biases Minimal biases Validation as a methodology for new Validation as a methodology for new

organismsorganisms

Identifying expected cis-regulatory Identifying expected cis-regulatory motif EACH TIME!!motif EACH TIME!!

Identifying regulatory networks Identifying regulatory networks by combinatorial analysis of by combinatorial analysis of

promoter elementspromoter elementsby Yitzhak Pilpel, Priya Sudarsanam & George M.Churchby Yitzhak Pilpel, Priya Sudarsanam & George M.Church

Understand transcriptional

network

Goals:

Identify motif combinations

affecting expression patterns in yeast

Basic definitionsBasic definitions

Expression coherence Expression coherence score-score-

Synergistic motifs – Synergistic motifs –

EC(a&b) > EC(a\b) , EC(b\EC(a&b) > EC(a\b) , EC(b\a)a)

Methods:Methods:A database of motifs

Gene sets

Calculating EC score

Significant synergistic combinations

Visualizing the transcriptional

network

Understanding the effect of individual and combination of

motifs

GMCGMC

GMC – Gene Motif Combination.GMC – Gene Motif Combination.

Motif numbers: Motif numbers: (m1, m2, m3, m4, m5) = (1,0,1,1,0)(m1, m2, m3, m4, m5) = (1,0,1,1,0)

Synergistic motif combination-Synergistic motif combination-EC(n motifs) > max(EC(n-1 motifs))EC(n motifs) > max(EC(n-1 motifs))

GMC – what is it good for?GMC – what is it good for?

CombinogramsCombinograms

ClusteringClustering

GMCsGMCs

Combinograms – what is it Combinograms – what is it good for?good for?

They help visualizing They help visualizing the “single motif - the “single motif - specific expression specific expression pattern” connectionpattern” connection

They also show which They also show which motif is more critical motif is more critical in determining in determining expression pattern.expression pattern.

Motif synergy mapMotif synergy mapvisualizing transcription networksvisualizing transcription networks

conclusionconclusion

The combinogram importanceThe combinogram importance

The motif synergy map importanceThe motif synergy map importance

Phylogenetic footprinting of Phylogenetic footprinting of transcription factor binding transcription factor binding

sites in proteobacterial sites in proteobacterial genomesgenomes

Lee Ann McCue, William Thompson, C.Steven Carmack, Michael P.Ryan, Jun Lee Ann McCue, William Thompson, C.Steven Carmack, Michael P.Ryan, Jun S.Liu, Victoria Derbyshire and Charles E.LawrenceS.Liu, Victoria Derbyshire and Charles E.Lawrence

Goals:

Identifying novel TF binding sites in

E.coli

Describing transcription

regulatory network

Finding

orthologsIdentify

upstream sequence patterns

Local optimum

Gibbs sampling algorithm

Methods:Methods:

Data set

Gibbs sampling algorithm

Motif

One E.coli gene One E.coli gene and orthologsand orthologs

MAP score – a measure MAP score – a measure of of

overrepresentationoverrepresentation

of motifof motif

Applying the method in a small Applying the method in a small scale – Validationscale – Validation

Choosing 190 E.coli genes.Choosing 190 E.coli genes. Creating 184 data sets.Creating 184 data sets. Running Gibbs sampling algorithm.Running Gibbs sampling algorithm. More than 67% success in the More than 67% success in the

prediction for the most probable motif.prediction for the most probable motif.

Motif ModelMotif Model

Identification of the YijC Identification of the YijC binding sitesbinding sites

A strongly predicted site was A strongly predicted site was upstream of the fabA, fabB and yqfA upstream of the fabA, fabB and yqfA genes.genes.

Chromatography – identifying the Chromatography – identifying the factor.factor.

Identifying the YijC binding Identifying the YijC binding sites and predicting gene sites and predicting gene

functionfunction

Mass spectrometry Mass spectrometry identification – YijCidentification – YijC

Predicting a Predicting a function for yqfA. function for yqfA.

wei

ght

fabAfa

bByq

fAfa

dB

Applying the method genome Applying the method genome widewide

Choosing 2113 E.coli ORFs.Choosing 2113 E.coli ORFs.

For 2097 a TF-binding site was For 2097 a TF-binding site was predicted.predicted.

Map scores- ortholog Map scores- ortholog distributiondistribution

Study set

Full set

Adding binding sites for known Adding binding sites for known TFsTFs

Building a TF binding site model for Building a TF binding site model for known TFs.known TFs.

Scanning E.coli upstream regions.Scanning E.coli upstream regions.

187 new probable sites.187 new probable sites.

Building a regulatory Building a regulatory networknetwork

Required steps:Required steps: Identifying motif modelsIdentifying motif models Clustering the modelsClustering the models

Problem:Problem: Specifity Specifity

ConclusionConclusion

What have we gained so far?What have we gained so far?

A better prediction of gene function.A better prediction of gene function.

New possibilities for identification of TF New possibilities for identification of TF binding site and the TF which binds binding site and the TF which binds them!!!them!!!

Documents

March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali