View
212
Download
0
Category
Tags:
Preview:
Citation preview
Deciphering Gene Regulatory Networks by in silico approaches
Sridhar Hannenhalli Penn Center for Bioinformatics
Department of GeneticsUniversity of Pennsylvania
Transcriptional RegulationTranscriptional Regulation
TF-DNA binding
Interactions and
Modules
Transcription Start Site
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
IdentificationRepresentationDiscovery (motif-discovery)SearchAmbiguity/Redundancy
Binding site identification
SELEX
Deletion/Mutation
ChIP-chip
ATACGGT
ATACCGT
ATCGGCA
AAAGGCT
CONSENSUS
A T A S G S T
WEIGHT MATRIX +1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96
Specificity
Binding site search
TFs often bind to short and degenerate DNA sequences, leading to false positives
Evolutionary conservation (phylogenetic footprinting/shadowing) can help reduce the false positives
About half of the functional binding sites are not conserved
A combination of evolutionary conservation and binding site score can detects ~70% of the experimentally verified binding sites at a “False Positive” rate of 1/50kb per PWM (Levy and Hannenhalli, Mammalian Genome, 2002)
TRANSFAC/JASPAR PWM
Multi-species conservationHuman genome
Non-Independence of binding site positions
Bacteriophage Mnt prefers binding to C, instead of wild-type A, at position 16 when wild-type C at position 17 is changed to other bases. (Man and Stormo, 2001, NAR)
Barash, Elidan, Freidman, Kaplan, 2003, RECOMB
Osada, Zaslavsky and Singh, 2004, Bioinformatics
Binding site representation
ATACGGT
ATACCGT
CGCGGCA
CGAGCCT
WEIGHT MATRIX +1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96
Assumption of positional independence
ATACGGT
ATACCGT
CGCGGCA
CGAGCCT
A PSPA or Variable length Markov Model of binding sites is superior to the PWM model
For 95 JASPAR PWMs, PSPAM is better in 48 cases and worse in 6 cases at significant For 95 JASPAR PWMs, PSPAM is better in 48 cases and worse in 6 cases at significant level of 0.05.level of 0.05.
Conservation patterns in cis-elements reveal inter-position dependence
Human ……….ACCGTGT……….ACCTTCT…………..Chimp ……….AGCGTGT……….ACCTTGT…………..Mouse ……….TCGGTGA……….TGCTTCT…………..Rat ……….CCCGTGA……….AGCTTGT…………..Dog ……….TCGGTCT……….ACCCTCT…………..
C C G C G G G G G C
X Y
1 2 3 N (binding sites)
X Y X Y X Y
Compensatory Mutation SXY = fraction of sites for which Pr(X | Y) > Pr(X)
Pr(X) = probability of X using standard tree Markov process
Pr(X|Y) = probability of X dependent on corresponding Y branches
Scope = |X – Y|
Control-1 Randomly select i, j pairs. Control-2 Randomly select i and then select j=i+s. Control-3 constructs PWM Mr with same width as M by randomly sampling columns from the 79 vertebrate PWMs in JASPAR. Control-4 Construct PWM Mr from M by randomly shuffling the compositions at each column (position).
SX,X+1 for 79 vertebrate PWMs from JASPAR
SX,X+s decreases with increasing scope s.
However it remains significantly greater than the respective control-4 up to scope = 6
Functional relevance of positions with compensatory mutation
Evans, Donahue, Hannenhalli, RECOMB-Comparative Genomics 2006
Binding site Ambiguity/Redundancy
Several transcription factors have distinct PWMs
Several distinct transcription factors have very similar PWMs
ACCGTGTTTACCGACTTTACCGTGAATACCGTGTTTTCCGTGTTTTCAGTGTTTTCTGTGTTTTCGGTGTTT
PWM
PWM1
PWM2
A mixture model allowing an arbitrary number of base PWMA mixture model allowing an arbitrary number of base PWM
∑ ∏= =
=k
j
n
uiujjkki uXMMMX
1 111 ],[),...,,,...,|Pr( λλλ
)},(),...,,{( 11 kk MM λλ
Use EM algorithm to estimate subclasses
We use k=2 base class PWMs (due to lack of data and lack of knowledge of appropriate number of classes)
Given mixture
the probability of observing sequence Xi = (Xi1,…, Xin) is
Enhancing Positional Weight Matrices using Mixture models
Hannenhalli and Wang, Bioinformatics, 2005
Based on 64 Vertebrate TF entries in JASPAR databaseBased on 64 Vertebrate TF entries in JASPAR database
0%
10%
20%
30%
40%
50%
60%
70%
80%
At least one PWMmore conserved
Mixture more conserved
Both PWMs moreconserved
4839
23
Sequence conservation of binding sites using Mixture model
Subclass Dissimilarity vs Prediction ImprovementSubclass Dissimilarity vs Prediction Improvement
Less dissimilar More dissimilar
0%
20%
40%
60%
80%
100%
>=0(64)>=0.8(57) >=1(44)>=1.2(32)>=1.4(20)>=1.6(16)Worse
Better
39 36 30 23 15 13
64 57 44 32 20 16
Relative entropy between two base PWMs
Expression Coherence of target genes using mixture modelExpression Coherence of target genes using mixture model
PWM1 PWM2
EC of a set of genes is the fraction of gene-pairs whose expressions across several tissues/conditions are “very” similar
Is the intra-class EC higher than inter-class EC?
In 44 of the 55 (80%) cases, the average expression coherence within subclass-PWM targets was higher than expression coherence of across subclass targets.
In all but one cases (98%) at least one of the two subclass PWMs had a coherence score higher than the cross coherence score.
Hannenhalli and Wang, Bioinformatics, 2005
LEU3 Dataset LEU3 Dataset [[Liu et al., Liu et al., 2002]2002]
FFree energy of binding ree energy of binding available for 46 available for 46 observed binding sites of LEU3 [observed binding sites of LEU3 [Liu et al., Liu et al., 20022002]]
TheThe two clusters two clusters from the EM algorithmfrom the EM algorithm have have significantly different binding energiessignificantly different binding energies..
ACCGTCTCAAACCGTGTGAAAGCGTGCCCTACGGTGCCCATGGCCGCCGATCGCACTCTTTGCCCCTGCTTGGCCCTCTT
I
II
III
IV
V
HorizontalPartitioning
VerticalPartitioning
Bi-clustering based modeling
ATACGGT
ATACCGT
CGCGGCA
CGAGCCT
ACCGTGTTTACCGACTTTACCGTGAATACCGTGTTTTCCGTGTTTTCAGTGTTTTCTGTGTTTTCGGTGTTT
Vertical partitioning
Horizontal partitioning
X
YX
Z X
Context-dependent binding specificity
Binding site Ambiguity/Redundancy
Several transcription factors have distinct PWMs
Several distinct transcription factors have very similar PWMs
TESS
+1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96
+1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96
32 Class
80 Family
117 Subfamily
1034 factors
DNA Binding Domain
Interaction Domain
Conserved DBD
Redundantparalogs Divergent
Promoter
Divergent nDBD
Once upon a time a transcription factor gene was duplicated
Promoter
Divergent Expression
Hypothesis: Homologous TF-pairs with similar DBD have diverged in expression.
Control: Homologous nonTF-pairsHomologous TF-pairs with dissimilar DBD
T1
D(X,Y) = |EX – EY|
T158
TF X
TF Y
Ti
Homologous TFs with Similar vs Non-Similar Binding in a Human Thyroid
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Expression divergence
Homologous TFs with Similar Binding
Homologous TFs with Non-Similar Binding
416 homologous TF-pairs (BLAST E-value <= E-10)125 with similar binding (p-value <= 0.02)
In thyroid tissue the hypothesis holds (Mann-Whitney p-value = 0.00156)
TFs with similar binding are more similar overall. Thus a greater expression divergence is surprising.
In Yeast, 219 homologous TFs, 35 with similar bindingIn a total of 57 samples (Spellman)
In Human, 416 homologous TFs, 125 with similar bindingIn a total of 158 samples (Novartis)
p-value Number of Yeast Samples
0.1 49.1% (28)
0.05 33.3% (19)
0.01 1.8% (1)
p-value Number of Human Tissues – MW test
0.1 91.7% (145)
0.05 87.3% (138)
0.01 74.7% (118)
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
Transcription Factor cooperation/interaction
Expression Coherence
Pilpel et al. (2001). Nat Genet,
Banerjee and Zhang (2003) NAR
Positional Coherence
Hannenhalli and Levy (2002). NAR.
Interaction-dependent binding
Interaction-dependent binding
Transcription Factor F
ChIP-chip Set of gene promotersbound by F
DNA binding motif M of F
Bound promoters (P)
Unbound promoters (B)
Can M discriminate between
P and B?
The answer is NO for a large fraction of transcription factors
Perhaps binding of F depends (synergistic or antagonistic) on other motifs
ijRk
ikjkijjjij
j
xbxaY εμ +++= ∑∈
Wang, Jensen, Hannenhalli RECOMB-Regulation 2005
The ChIP-chip data for a majority of TFs is better explained using interaction-dependent binding.
Almost all of the Yeast cell cycle interactions were detected at 10% prediction rate
When applied to genome-wide CREB binding in rat, 15 of the 18 detected interactions have varying degree of support.
PWM based
occupancy probabilit
y
Binding probability (ChIP)
PWM based
occupancy probability
Interaction coefficient
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
Co-regulated genes have common binding sites in their Co-regulated genes have common binding sites in their promoterspromoters
BCL2-antagonist(BAD)
B-cell CLL/lymphoma 2(BCL2)
Apoptosis Pathway
AP-2, CREB, E2F, cMyc, NF-Kappa-b, c-ETS, Egr-1 etc.
68 TFs
89 TFs
37 TFs in common
Hypergeometric p-val = E-11
374
89 3768
Interacting proteins have greater similarity in their Interacting proteins have greater similarity in their promoter regionspromoter regions
Hannenhalli and Levy (2003). Mamm Genome
Transcriptional module discovery
Singular Value Decomposition
1 1 1 0 0 11 1 0 0 1 01 1 1 0 0 00 0 0 1 0 1
1 0 0 0 1 11 0 1 1 0 10 1 0 1 0 01 0 1 0 0 0
Distance Matrix K-means Clustering
Gen
esTFs
Cluster of genesand discriminating TF
Clique enumeration in bipartite graphs
Genes
TFs
Gene
Tissue
Tissue-Specific Transcriptional Tissue-Specific Transcriptional ModuleModule
TFTissue
Binding predictionTissue specificityby expression level[Schug et al 2005]
Transcriptional-Modulespecific to a tissue type
Everett, Wang, Hannenhalli, ISMB 2006
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
Transcriptional Regulation in Cardiac Myocytes
Frey N, Olson EN. Annu Rev Physiol. 2003;65:45-79.
Large tissue bank from Temple and PennLarge tissue bank from Temple and Penn Failing explanted hearts (n=173) Failing explanted hearts (n=173) Non-failing hearts from unused donors (n=16)Non-failing hearts from unused donors (n=16) Each hybridized with an HU133A (n=189)Each hybridized with an HU133A (n=189) Conservative analysis: RMA (bioconductor), SAM Conservative analysis: RMA (bioconductor), SAM
Expression profiling in advanced heart failureExpression profiling in advanced heart failure
~3000 dysregulated genes in advanced human HF with FDR < 5%.
Is there any evidence that specific transcription factors are directing these changes?
Set of transcripts representedon array (~20 -40K)
Genomic sequences -5kb promoter regions
TF binding site annotation for all transcripts
TF targets altered in disease
Refseq
Transfac / Human -Mouse conservation
TF binding sites over-represented in diseaseExpression Data
(diseased and control)
Annotation
Analysis
Set of transcripts representedon array (~20 -40K)
Genomic sequences -5kb promoter regions
TF binding site annotation for all transcripts
TF targets altered in disease
Refseq
Transfac / Human -Mouse conservation
TF binding sites over-represented in diseaseExpression Data
(diseased and control)
Annotation
Analysis
Transcriptional Genomics
Differentially expressed Genes (G)
Background Set (B) Statistical Significance is computed using 1000 random sampling of genes from background set
Score(x) = freq(x) in G / freq(x) in B
TRANSFAC ID Fold enrichment p-value FactorM00471 1.70 0.000 TBPM00318 1.63 0.001 Lentiviral_Poly_AM00062 1.52 0.000 IRF-1M00138 1.50 0.004 OctamerM00291 1.48 0.000 Freac-3M00403 1.48 0.001 aMEF-2M00103 1.48 0.000 CloxM00216 1.47 0.000 TATAM01000 1.46 0.001 AIREM00109 1.46 0.000 C/EBPbetaM00405 1.45 0.001 MEF-2M00451 1.45 0.004 NKX3AM00972 1.44 0.001 IRFM00249 1.43 0.002 CHOP:C/EBPalphaM00102 1.43 0.002 CDPM00302 1.43 0.000 NF-ATM00729 1.42 0.003 Cdx-2M00622 1.41 0.001 C/EBPgammaM00078 1.41 0.005 Evi-1M00407 1.40 0.003 RSRFC4M00616 1.39 0.004 AFP1M00310 1.35 0.000 APOLYAM00770 1.35 0.002 C/EBPM00485 1.34 0.002 Nkx2-2M00432 1.34 0.004 TTF1M00346 1.34 0.002 GATA-1M00478 1.34 0.003 Cdc5M00724 1.33 0.005 HNF-3alphaM00699 1.32 0.002 ICSBPM00394 1.31 0.002 Msx-1M00088 1.28 0.005 Ik-3M00238 1.27 0.005 Barbie_Box
Transcription Factors enriched in differentially up-regulated genes
The differentially upregulated genes have a greater number The differentially upregulated genes have a greater number (32) of enriched TFs compared to downregulated genes (6).(32) of enriched TFs compared to downregulated genes (6).
The ischemic and idiopathic cases are consistentThe ischemic and idiopathic cases are consistent
Validation of GATA, MEF2, NKx, NFAT transcription factors in Validation of GATA, MEF2, NKx, NFAT transcription factors in human heart failurehuman heart failure
Potential role for FOX factors and IRFPotential role for FOX factors and IRF
What about early events?What about early events?
Mice with infarcts and sham operated controls sacrificed at varying times after surgery (1, 4, 8, 24 hrs, 8 wks)
Analysis of differentially co-regulated gene clusters reveal consistent set of transcription factors.
FOX factor SummaryFOX factor Summary
FOX targets change substantially in advanced human FOX targets change substantially in advanced human HF and in early HF in mice.HF and in early HF in mice.
FOX factors are present in human heart at FOX factors are present in human heart at physiologic levels: FOXP1, P4, C1, C2, J2physiologic levels: FOXP1, P4, C1, C2, J2
FOXP1 is localized to nuclei of human cardiac FOXP1 is localized to nuclei of human cardiac myocytes.myocytes.
Do FOX factors mediate cardiac hypertrophy?Do FOX factors mediate cardiac hypertrophy?
Hannenhalli et al. Circulation, 2006
Naïve (N)
Conditioned Stimulus only (CS)
Fear Conditioned (FC)
Gene Regulation in Learning and Memory
Hippocampus
Amygdala
Keeley et al. Memory and Learning, 2006
Immediate Early Gene Expression is Immediate Early Gene Expression is Regulated by Many Transcription FactorsRegulated by Many Transcription Factors
http://web1.tch.harvard.edu/research/greenberg/oldsite/Pathways.html
50 Most Significantly Regulated Genes 50 Most Significantly Regulated Genes were Used for Further Analysiswere Used for Further Analysis
rank Symbol Molecular Role N (log2) CS vs N (%) FC vs N (%)
1 Fos DNA-binding transcription factor 5.4 134 1532 Ssty1 unknown 3.9 -22 -243 Ssty2 unknown 4.8 -27 -314 Dusp1 Phosphatase 7.4 33 365 Cd84 cell adhesion 7.6 -30 -336 Pura DNA-binding transcription factor 7.7 41 337 Nr4a1 DNA-binding transcription factor 7.2 33 408 Egr1 DNA-binding transcription factor 8.4 27 309 Cacna2d1 voltage dependent calcium channel 6.7 31 34
10 Junb DNA-binding transcription factor 7.6 20 24
rank Symbol Molecular Role N (log2) CS vs N (%) FC vs N (%)
1 Junb DNA-binding transcription factor 7.11 32 552 Fos DNA-binding transcription factor 5.26 123 2003 Nr4a1 DNA-binding transcription factor 6.44 28 434 Ier2 unknown 3.97 30 385 ly6e unknown 6.47 17 216 Stk19 serine/threonine kinase 6.61 14 147 Gadd45g upstream activator of p38 and JNK MAPKs 6.18 18 268 Egr1 DNA-binding transcription factor 7.98 32 459 Aaas nuclear pore/adapter 5.42 21 2510 Mlf2 unknown 9.54 13 18
Hippocampus
Amygdala
Hippocampus- and Amygdala-specific Hippocampus- and Amygdala-specific promoter modeling promoter modeling
Hippocampus: Hippocampus:
CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Nrf-1 Nrf-1
Amygdala: Amygdala:
CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Ets1, Elk1, Ets1, Elk1, Myc/Max, USFMyc/Max, USF
Promoter models were able to predict Promoter models were able to predict regulation of less significant genes with regulation of less significant genes with
some system specificity some system specificity
A. Genes Predicted by Hippocampus Promoter Model
0%
2%
4%
6%
8%
Hippocampus Amygdala
Tissue Examined
Average Change FC vs N
B. Genes Predicted by Amygdala Promoter Model
0%
2%
4%
6%
8%
Hippocampus Amygdala
Tissue Examined
Average Change FC vs N
Core promoter prediction
TF-DNA binding
TF-TF interactions
Transcriptional Modules
Applications
Overview
Core Promoter : Minimal DNA sequence required for the assembly of the Pre-initiation complex (~100 bps flanking the TSS)
Goal : Determine sequence properties responsible for precise Pol-II localiazation
1990 1995 2000 2006
TATA
PromoterScan
Promoter1.0
Autogene
PromFind
TSSG
Calverie
NNPP
CorePromoter
PromoterInspector
Hannenhalli
FirstEF
Dragon
PSPA
CpG island line
CpG Islands
Unmethylated GC-rich regions (experimental)
GC-rich regions ( 200 bp) on the genome with high CG di-nucleotide frequency (computational)
6.05.0 ≥≥+GC
CGGC ff
fANDf
Gardiner-Garden and Frommer, 1987
About half of all genes have a CpG island overlapping the first exon.
Antequera and Bird, 1993
Long range sequence Characteristics(10kb)
Short genomicSub regional signal,
eg. CpG island(0.5~2kb) Specific cis elements (eg. TATA)
Categories of DNA sequence “signals” used in promoter prediction
TSS
Generalization of Markov Models
Wang and Hannenhalli, BMC BI, 2005
Position Specific Propensity Analysis (PSPA)Position Specific Propensity Analysis (PSPA)
PSPA based Model
Use +-100bp around TSS as training
Wang and Hannenhalli, BBRC, 2006
Overlap between prediction toolsOverlap between prediction tools
Carninci et al. (2006). "Genome-wide analysis of mammalian promoter architecture and evolution." Nat Genet 38(6): 626-635.
CpG poor promoters have greater conservation and CpG poor promoters have greater conservation and fewer aTSS and mostly involved in extra-cellular and fewer aTSS and mostly involved in extra-cellular and stress-response activities.stress-response activities.
By including position specific motifs and their co-By including position specific motifs and their co-occurrence, PSPA improves the Transcription Start site occurrence, PSPA improves the Transcription Start site localization.localization.
Many Position Specific elements are associated with Many Position Specific elements are associated with target gene function.target gene function.
There is little overlap among various state-of-the-art There is little overlap among various state-of-the-art prediction tools.prediction tools.
Alternative promoters have tissue specific usageAlternative promoters have tissue specific usage
Acknowledgement
Junwen Wang PCBI, UPennLarry Singh PCBI, UPennLi-San Wang Biology, UPennShane Jensen Statistics, Wharton, UPenn
Perry EvansGreg Donahue Genomics and Comp Bio, Upenn
Tom Cappola Cardiology, UPenn
Mike Keeley Biology, UpennTed Abel Biology, Upenn
Recommended