Upload
thomasina-matthews
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
March 03March 03
Identification of Identification of Transcription Factor Transcription Factor
Binding SitesBinding Sites
Presenting:Presenting:
Mira & TaliMira & Tali
GoalGoalAGCCA
AGCCA
AGCCA
AGCCA
AGCCA
AGCCA
Regulatory Regulatory regionsregions
Motif – Motif –
Binding site???Binding site???
Why Bother?Why Bother?
Gene expression regulation
Co-regulation
UNDERSTAND
DifficultiesDifficulties
Multiple factors for a single geneMultiple factors for a single gene
Variability in binding sitesVariability in binding sites The nature of variability is NOT well understoodThe nature of variability is NOT well understood Usually Transitions Usually Transitions Insertions and deletions are uncommonInsertions and deletions are uncommon
Location, location, location…Location, location, location…
EMSA – Electrophoretic mobility shift EMSA – Electrophoretic mobility shift assayassay
Nuclease protection assay Nuclease protection assay
Experimental methodsExperimental methods
NOT ENOUGH!!!!!
So, what can we do?So, what can we do?
Find conserved sequences in Find conserved sequences in regulation regionsregulation regions
1. Define what you want to find1. Define what you want to find
2. Define what is a good result2. Define what is a good result
3. Decide how to find it…3. Decide how to find it…
Global optimum Global optimum Enumerative methodsEnumerative methods
Going over ALL possibilitiesGoing over ALL possibilities
Taking the best oneTaking the best one
Principal Methods:Principal Methods:
Disadvantage :
Limited to small search spaces
Advantage :
Certainty
Principal Methods:Principal Methods:
Disadvantage :
You can never know…
Advantage :
Basically good results, faster
Local optimum Local optimum Gibbs sampling, AlignACEGibbs sampling, AlignACE
Start somewhere (arbitrary)Start somewhere (arbitrary)Next step direction – proportional to what Next step direction – proportional to what we “gain” from itwe “gain” from itWe can get We can get anywhereanywhere with some with some probabilityprobability
Identifying motifsIdentifying motifs Expression patterns Expression patterns Phylogenetic footprintingPhylogenetic footprinting
Identifying networksIdentifying networks Common motifs in expression clustersCommon motifs in expression clusters Combinatorial analysisCombinatorial analysis
Articles OverviewArticles Overview
Discovery of novel trancription Discovery of novel trancription factor binding sites by statistical factor binding sites by statistical
overrepresentationoverrepresentationS. Sinha, M. TompaS. Sinha, M. Tompa
Identify binding sites in yeast
Goal:
Use sets of co-regulated genes
Identify over-represented
upstream sequences
Enumeration YMF algorithm
What constitutes a motif?What constitutes a motif?(tailored for S.cerevisiae)(tailored for S.cerevisiae)
In S.cerevisiae typically 6-10 In S.cerevisiae typically 6-10 conserved bases – The motifconserved bases – The motif
Spacers varying in length (1-11bp)Spacers varying in length (1-11bp) Usually located in the middle Usually located in the middle
Taken from SCPD – S.cerevisiae promoter database
ACCNNNNNNGTT
Z-score – Z-score – MotifMotif over-representationover-representation
PPmaxmax(X) – (X) – Probability of ZProbability of Zscorescore >= X >= X
How do we measure motifs?How do we measure motifs?
YMF algorithmYMF algorithmYeast Motif FinderYeast Motif Finder
INPUT:
A set of promoter regions Motif length -
l
• modest values
Maximum number of spacers allowed - w
Transition Matrix
6
11
YMF algorithmYMF algorithm
Post Processing:
FindExplanators:
artificial overrepresentation
W-score
Co-expression score
TCACGCT (motif)
CACGCTA (artifact)
ExperimentsExperiments
Validate YMF resultsValidate YMF results Running YMF on regulons with known Running YMF on regulons with known
binding sites (SCPD)binding sites (SCPD)
Run YMF on MIPS catalogsRun YMF on MIPS catalogs(MIPS - Munich Information center for Protein Sequences)(MIPS - Munich Information center for Protein Sequences)
Functional Functional Mutant phenotypeMutant phenotype
Validation Validation
New binding sitesNew binding sites or false positives? or false positives?
A novel site candidateA novel site candidate
Further researchFurther research
Validation of novel binding sites and Validation of novel binding sites and transcription factorstranscription factors
Modification of the algorithm to be Modification of the algorithm to be applicable for other organismsapplicable for other organisms
Systematic determination of Systematic determination of genetic network architecturegenetic network architectureSaeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho, Saeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho,
George M. ChurchGeorge M. Church
Cluster by expression patterns
Identify upstream sequence patterns
Identify co- regulated networks of genes in yeast
Goal:
AlignACE
Aligns Nucleic Acid Conserved Elements
ClustersClusters
Cluster – a group of genes with a Cluster – a group of genes with a similar expression patternsimilar expression pattern
Cluster’s members Cluster’s members Tend to participate in common Tend to participate in common
processesprocesses Tend to be co-regulatedTend to be co-regulated
ClustersClusters 10-54
Identifying motifsIdentifying motifs Using AlignACE Using AlignACE
18 motifs from 18 motifs from 12 clusters 12 clusters were found.were found.
7 of the found 7 of the found motifs were motifs were identified identified experimentally experimentally
And what about the And what about the others????others????
Scanning for more binding Scanning for more binding sitessites
Once a significant motif was found Once a significant motif was found the whole genome was scanned for itthe whole genome was scanned for it
Most motifs were cluster specificMost motifs were cluster specific
Why so few motifs?Why so few motifs?
Too stringent rules for defining a Too stringent rules for defining a “significant” motif“significant” motif
Post transcriptional regulation (mRNA Post transcriptional regulation (mRNA stability)stability)
Some clusters represent “noise”Some clusters represent “noise”
““Tightness”Tightness”
““Tightness” of a clusterTightness” of a cluster how close are the cluster members of a how close are the cluster members of a
particular cluster to its mean particular cluster to its mean
A strong correlation between the A strong correlation between the presence of significant motifs and presence of significant motifs and the “tightness” of a clusterthe “tightness” of a cluster
Things to rememberThings to remember
Discovering regulons and motifs using Discovering regulons and motifs using expression based clusteringexpression based clustering
Minimal biases Minimal biases Validation as a methodology for new Validation as a methodology for new
organismsorganisms
Identifying expected cis-regulatory Identifying expected cis-regulatory motif EACH TIME!!motif EACH TIME!!
Identifying regulatory networks Identifying regulatory networks by combinatorial analysis of by combinatorial analysis of
promoter elementspromoter elementsby Yitzhak Pilpel, Priya Sudarsanam & George M.Churchby Yitzhak Pilpel, Priya Sudarsanam & George M.Church
Understand transcriptional
network
Goals:
Identify motif combinations
affecting expression patterns in yeast
Basic definitionsBasic definitions
Expression coherence Expression coherence score-score-
Synergistic motifs – Synergistic motifs –
EC(a&b) > EC(a\b) , EC(b\EC(a&b) > EC(a\b) , EC(b\a)a)
Methods:Methods:A database of motifs
Gene sets
Calculating EC score
Significant synergistic combinations
Visualizing the transcriptional
network
Understanding the effect of individual and combination of
motifs
GMCGMC
GMC – Gene Motif Combination.GMC – Gene Motif Combination.
Motif numbers: Motif numbers: (m1, m2, m3, m4, m5) = (1,0,1,1,0)(m1, m2, m3, m4, m5) = (1,0,1,1,0)
Synergistic motif combination-Synergistic motif combination-EC(n motifs) > max(EC(n-1 motifs))EC(n motifs) > max(EC(n-1 motifs))
GMC – what is it good for?GMC – what is it good for?
CombinogramsCombinograms
ClusteringClustering
GMCsGMCs
Combinograms – what is it Combinograms – what is it good for?good for?
They help visualizing They help visualizing the “single motif - the “single motif - specific expression specific expression pattern” connectionpattern” connection
They also show which They also show which motif is more critical motif is more critical in determining in determining expression pattern.expression pattern.
Motif synergy mapMotif synergy mapvisualizing transcription networksvisualizing transcription networks
conclusionconclusion
The combinogram importanceThe combinogram importance
The motif synergy map importanceThe motif synergy map importance
Phylogenetic footprinting of Phylogenetic footprinting of transcription factor binding transcription factor binding
sites in proteobacterial sites in proteobacterial genomesgenomes
Lee Ann McCue, William Thompson, C.Steven Carmack, Michael P.Ryan, Jun Lee Ann McCue, William Thompson, C.Steven Carmack, Michael P.Ryan, Jun S.Liu, Victoria Derbyshire and Charles E.LawrenceS.Liu, Victoria Derbyshire and Charles E.Lawrence
Goals:
Identifying novel TF binding sites in
E.coli
Describing transcription
regulatory network
Finding
orthologsIdentify
upstream sequence patterns
Local optimum
Gibbs sampling algorithm
Methods:Methods:
Data set
Gibbs sampling algorithm
Motif
One E.coli gene One E.coli gene and orthologsand orthologs
MAP score – a measure MAP score – a measure of of
overrepresentationoverrepresentation
of motifof motif
Applying the method in a small Applying the method in a small scale – Validationscale – Validation
Choosing 190 E.coli genes.Choosing 190 E.coli genes. Creating 184 data sets.Creating 184 data sets. Running Gibbs sampling algorithm.Running Gibbs sampling algorithm. More than 67% success in the More than 67% success in the
prediction for the most probable motif.prediction for the most probable motif.
Motif ModelMotif Model
Identification of the YijC Identification of the YijC binding sitesbinding sites
A strongly predicted site was A strongly predicted site was upstream of the fabA, fabB and yqfA upstream of the fabA, fabB and yqfA genes.genes.
Chromatography – identifying the Chromatography – identifying the factor.factor.
Identifying the YijC binding Identifying the YijC binding sites and predicting gene sites and predicting gene
functionfunction
Mass spectrometry Mass spectrometry identification – YijCidentification – YijC
Predicting a Predicting a function for yqfA. function for yqfA.
wei
ght
fabAfa
bByq
fAfa
dB
Applying the method genome Applying the method genome widewide
Choosing 2113 E.coli ORFs.Choosing 2113 E.coli ORFs.
For 2097 a TF-binding site was For 2097 a TF-binding site was predicted.predicted.
Map scores- ortholog Map scores- ortholog distributiondistribution
Study set
Full set
Adding binding sites for known Adding binding sites for known TFsTFs
Building a TF binding site model for Building a TF binding site model for known TFs.known TFs.
Scanning E.coli upstream regions.Scanning E.coli upstream regions.
187 new probable sites.187 new probable sites.
Building a regulatory Building a regulatory networknetwork
Required steps:Required steps: Identifying motif modelsIdentifying motif models Clustering the modelsClustering the models
Problem:Problem: Specifity Specifity
ConclusionConclusion
What have we gained so far?What have we gained so far?
A better prediction of gene function.A better prediction of gene function.
New possibilities for identification of TF New possibilities for identification of TF binding site and the TF which binds binding site and the TF which binds them!!!them!!!