Deciphering Gene Regulatory Networks by in silico approaches Sridhar Hannenhalli Penn Center for...

Preview:

Citation preview

Deciphering Gene Regulatory Networks by in silico approaches

Sridhar Hannenhalli Penn Center for Bioinformatics

Department of GeneticsUniversity of Pennsylvania

Transcriptional RegulationTranscriptional Regulation

TF-DNA binding

Interactions and

Modules

Transcription Start Site

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

IdentificationRepresentationDiscovery (motif-discovery)SearchAmbiguity/Redundancy

Binding site identification

SELEX

Deletion/Mutation

ChIP-chip

ATACGGT

ATACCGT

ATCGGCA

AAAGGCT

CONSENSUS

A T A S G S T

WEIGHT MATRIX +1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96

Specificity

Binding site search

TFs often bind to short and degenerate DNA sequences, leading to false positives

Evolutionary conservation (phylogenetic footprinting/shadowing) can help reduce the false positives

About half of the functional binding sites are not conserved

A combination of evolutionary conservation and binding site score can detects ~70% of the experimentally verified binding sites at a “False Positive” rate of 1/50kb per PWM (Levy and Hannenhalli, Mammalian Genome, 2002)

TRANSFAC/JASPAR PWM

Multi-species conservationHuman genome

Non-Independence of binding site positions

Bacteriophage Mnt prefers binding to C, instead of wild-type A, at position 16 when wild-type C at position 17 is changed to other bases. (Man and Stormo, 2001, NAR)

Barash, Elidan, Freidman, Kaplan, 2003, RECOMB

Osada, Zaslavsky and Singh, 2004, Bioinformatics

Binding site representation

ATACGGT

ATACCGT

CGCGGCA

CGAGCCT

WEIGHT MATRIX +1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96

Assumption of positional independence

ATACGGT

ATACCGT

CGCGGCA

CGAGCCT

A PSPA or Variable length Markov Model of binding sites is superior to the PWM model

For 95 JASPAR PWMs, PSPAM is better in 48 cases and worse in 6 cases at significant For 95 JASPAR PWMs, PSPAM is better in 48 cases and worse in 6 cases at significant level of 0.05.level of 0.05.

Conservation patterns in cis-elements reveal inter-position dependence

Human ……….ACCGTGT……….ACCTTCT…………..Chimp ……….AGCGTGT……….ACCTTGT…………..Mouse ……….TCGGTGA……….TGCTTCT…………..Rat ……….CCCGTGA……….AGCTTGT…………..Dog ……….TCGGTCT……….ACCCTCT…………..

C C G C G G G G G C

X Y

1 2 3 N (binding sites)

X Y X Y X Y

Compensatory Mutation SXY = fraction of sites for which Pr(X | Y) > Pr(X)

Pr(X) = probability of X using standard tree Markov process

Pr(X|Y) = probability of X dependent on corresponding Y branches

Scope = |X – Y|

Control-1 Randomly select i, j pairs. Control-2 Randomly select i and then select j=i+s. Control-3 constructs PWM Mr with same width as M by randomly sampling columns from the 79 vertebrate PWMs in JASPAR. Control-4 Construct PWM Mr from M by randomly shuffling the compositions at each column (position).

SX,X+1 for 79 vertebrate PWMs from JASPAR

SX,X+s decreases with increasing scope s.

However it remains significantly greater than the respective control-4 up to scope = 6

Functional relevance of positions with compensatory mutation

Evans, Donahue, Hannenhalli, RECOMB-Comparative Genomics 2006

Binding site Ambiguity/Redundancy

Several transcription factors have distinct PWMs

Several distinct transcription factors have very similar PWMs

ACCGTGTTTACCGACTTTACCGTGAATACCGTGTTTTCCGTGTTTTCAGTGTTTTCTGTGTTTTCGGTGTTT

PWM

PWM1

PWM2

A mixture model allowing an arbitrary number of base PWMA mixture model allowing an arbitrary number of base PWM

∑ ∏= =

=k

j

n

uiujjkki uXMMMX

1 111 ],[),...,,,...,|Pr( λλλ

)},(),...,,{( 11 kk MM λλ

Use EM algorithm to estimate subclasses

We use k=2 base class PWMs (due to lack of data and lack of knowledge of appropriate number of classes)

Given mixture

the probability of observing sequence Xi = (Xi1,…, Xin) is

Enhancing Positional Weight Matrices using Mixture models

Hannenhalli and Wang, Bioinformatics, 2005

Based on 64 Vertebrate TF entries in JASPAR databaseBased on 64 Vertebrate TF entries in JASPAR database

0%

10%

20%

30%

40%

50%

60%

70%

80%

At least one PWMmore conserved

Mixture more conserved

Both PWMs moreconserved

4839

23

Sequence conservation of binding sites using Mixture model

Subclass Dissimilarity vs Prediction ImprovementSubclass Dissimilarity vs Prediction Improvement

Less dissimilar More dissimilar

0%

20%

40%

60%

80%

100%

>=0(64)>=0.8(57) >=1(44)>=1.2(32)>=1.4(20)>=1.6(16)Worse

Better

39 36 30 23 15 13

64 57 44 32 20 16

Relative entropy between two base PWMs

Expression Coherence of target genes using mixture modelExpression Coherence of target genes using mixture model

PWM1 PWM2

EC of a set of genes is the fraction of gene-pairs whose expressions across several tissues/conditions are “very” similar

Is the intra-class EC higher than inter-class EC?

In 44 of the 55 (80%) cases, the average expression coherence within subclass-PWM targets was higher than expression coherence of across subclass targets.

In all but one cases (98%) at least one of the two subclass PWMs had a coherence score higher than the cross coherence score.

Hannenhalli and Wang, Bioinformatics, 2005

LEU3 Dataset LEU3 Dataset [[Liu et al., Liu et al., 2002]2002]

FFree energy of binding ree energy of binding available for 46 available for 46 observed binding sites of LEU3 [observed binding sites of LEU3 [Liu et al., Liu et al., 20022002]]

TheThe two clusters two clusters from the EM algorithmfrom the EM algorithm have have significantly different binding energiessignificantly different binding energies..

ACCGTCTCAAACCGTGTGAAAGCGTGCCCTACGGTGCCCATGGCCGCCGATCGCACTCTTTGCCCCTGCTTGGCCCTCTT

I

II

III

IV

V

HorizontalPartitioning

VerticalPartitioning

Bi-clustering based modeling

ATACGGT

ATACCGT

CGCGGCA

CGAGCCT

ACCGTGTTTACCGACTTTACCGTGAATACCGTGTTTTCCGTGTTTTCAGTGTTTTCTGTGTTTTCGGTGTTT

Vertical partitioning

Horizontal partitioning

X

YX

Z X

Context-dependent binding specificity

Binding site Ambiguity/Redundancy

Several transcription factors have distinct PWMs

Several distinct transcription factors have very similar PWMs

TESS

+1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96

+1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0-1.6 -1.6 0.0 0.59 0.0 0.59 -1.6-1.6 -1.6 -1.6 0.59 0.96 0.59 -1.6-1.6 0.96 -1.6 -1.6 -1.6 -1.6 0.96

32 Class

80 Family

117 Subfamily

1034 factors

DNA Binding Domain

Interaction Domain

Conserved DBD

Redundantparalogs Divergent

Promoter

Divergent nDBD

Once upon a time a transcription factor gene was duplicated

Promoter

Divergent Expression

Hypothesis: Homologous TF-pairs with similar DBD have diverged in expression.

Control: Homologous nonTF-pairsHomologous TF-pairs with dissimilar DBD

T1

D(X,Y) = |EX – EY|

T158

TF X

TF Y

Ti

Homologous TFs with Similar vs Non-Similar Binding in a Human Thyroid

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Expression divergence

Homologous TFs with Similar Binding

Homologous TFs with Non-Similar Binding

416 homologous TF-pairs (BLAST E-value <= E-10)125 with similar binding (p-value <= 0.02)

In thyroid tissue the hypothesis holds (Mann-Whitney p-value = 0.00156)

TFs with similar binding are more similar overall. Thus a greater expression divergence is surprising.

In Yeast, 219 homologous TFs, 35 with similar bindingIn a total of 57 samples (Spellman)

In Human, 416 homologous TFs, 125 with similar bindingIn a total of 158 samples (Novartis)

p-value Number of Yeast Samples

0.1 49.1% (28)

0.05 33.3% (19)

0.01 1.8% (1)

p-value Number of Human Tissues – MW test

0.1 91.7% (145)

0.05 87.3% (138)

0.01 74.7% (118)

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

Transcription Factor cooperation/interaction

Expression Coherence

Pilpel et al. (2001). Nat Genet,

Banerjee and Zhang (2003) NAR

Positional Coherence

Hannenhalli and Levy (2002). NAR.

Interaction-dependent binding

Interaction-dependent binding

Transcription Factor F

ChIP-chip Set of gene promotersbound by F

DNA binding motif M of F

Bound promoters (P)

Unbound promoters (B)

Can M discriminate between

P and B?

The answer is NO for a large fraction of transcription factors

Perhaps binding of F depends (synergistic or antagonistic) on other motifs

ijRk

ikjkijjjij

j

xbxaY εμ +++= ∑∈

Wang, Jensen, Hannenhalli RECOMB-Regulation 2005

The ChIP-chip data for a majority of TFs is better explained using interaction-dependent binding.

Almost all of the Yeast cell cycle interactions were detected at 10% prediction rate

When applied to genome-wide CREB binding in rat, 15 of the 18 detected interactions have varying degree of support.

PWM based

occupancy probabilit

y

Binding probability (ChIP)

PWM based

occupancy probability

Interaction coefficient

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

Co-regulated genes have common binding sites in their Co-regulated genes have common binding sites in their promoterspromoters

BCL2-antagonist(BAD)

B-cell CLL/lymphoma 2(BCL2)

Apoptosis Pathway

AP-2, CREB, E2F, cMyc, NF-Kappa-b, c-ETS, Egr-1 etc.

68 TFs

89 TFs

37 TFs in common

Hypergeometric p-val = E-11

374

89 3768

Interacting proteins have greater similarity in their Interacting proteins have greater similarity in their promoter regionspromoter regions

Hannenhalli and Levy (2003). Mamm Genome

Transcriptional module discovery

Singular Value Decomposition

1 1 1 0 0 11 1 0 0 1 01 1 1 0 0 00 0 0 1 0 1

1 0 0 0 1 11 0 1 1 0 10 1 0 1 0 01 0 1 0 0 0

Distance Matrix K-means Clustering

Gen

esTFs

Cluster of genesand discriminating TF

Clique enumeration in bipartite graphs

Genes

TFs

Gene

Tissue

Tissue-Specific Transcriptional Tissue-Specific Transcriptional ModuleModule

TFTissue

Binding predictionTissue specificityby expression level[Schug et al 2005]

Transcriptional-Modulespecific to a tissue type

Everett, Wang, Hannenhalli, ISMB 2006

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

Transcriptional Regulation in Cardiac Myocytes

Frey N, Olson EN. Annu Rev Physiol. 2003;65:45-79.

Large tissue bank from Temple and PennLarge tissue bank from Temple and Penn Failing explanted hearts (n=173) Failing explanted hearts (n=173) Non-failing hearts from unused donors (n=16)Non-failing hearts from unused donors (n=16) Each hybridized with an HU133A (n=189)Each hybridized with an HU133A (n=189) Conservative analysis: RMA (bioconductor), SAM Conservative analysis: RMA (bioconductor), SAM

Expression profiling in advanced heart failureExpression profiling in advanced heart failure

~3000 dysregulated genes in advanced human HF with FDR < 5%.

Is there any evidence that specific transcription factors are directing these changes?

Set of transcripts representedon array (~20 -40K)

Genomic sequences -5kb promoter regions

TF binding site annotation for all transcripts

TF targets altered in disease

Refseq

Transfac / Human -Mouse conservation

TF binding sites over-represented in diseaseExpression Data

(diseased and control)

Annotation

Analysis

Set of transcripts representedon array (~20 -40K)

Genomic sequences -5kb promoter regions

TF binding site annotation for all transcripts

TF targets altered in disease

Refseq

Transfac / Human -Mouse conservation

TF binding sites over-represented in diseaseExpression Data

(diseased and control)

Annotation

Analysis

Transcriptional Genomics

Differentially expressed Genes (G)

Background Set (B) Statistical Significance is computed using 1000 random sampling of genes from background set

Score(x) = freq(x) in G / freq(x) in B

TRANSFAC ID Fold enrichment p-value FactorM00471 1.70 0.000 TBPM00318 1.63 0.001 Lentiviral_Poly_AM00062 1.52 0.000 IRF-1M00138 1.50 0.004 OctamerM00291 1.48 0.000 Freac-3M00403 1.48 0.001 aMEF-2M00103 1.48 0.000 CloxM00216 1.47 0.000 TATAM01000 1.46 0.001 AIREM00109 1.46 0.000 C/EBPbetaM00405 1.45 0.001 MEF-2M00451 1.45 0.004 NKX3AM00972 1.44 0.001 IRFM00249 1.43 0.002 CHOP:C/EBPalphaM00102 1.43 0.002 CDPM00302 1.43 0.000 NF-ATM00729 1.42 0.003 Cdx-2M00622 1.41 0.001 C/EBPgammaM00078 1.41 0.005 Evi-1M00407 1.40 0.003 RSRFC4M00616 1.39 0.004 AFP1M00310 1.35 0.000 APOLYAM00770 1.35 0.002 C/EBPM00485 1.34 0.002 Nkx2-2M00432 1.34 0.004 TTF1M00346 1.34 0.002 GATA-1M00478 1.34 0.003 Cdc5M00724 1.33 0.005 HNF-3alphaM00699 1.32 0.002 ICSBPM00394 1.31 0.002 Msx-1M00088 1.28 0.005 Ik-3M00238 1.27 0.005 Barbie_Box

Transcription Factors enriched in differentially up-regulated genes

The differentially upregulated genes have a greater number The differentially upregulated genes have a greater number (32) of enriched TFs compared to downregulated genes (6).(32) of enriched TFs compared to downregulated genes (6).

The ischemic and idiopathic cases are consistentThe ischemic and idiopathic cases are consistent

Validation of GATA, MEF2, NKx, NFAT transcription factors in Validation of GATA, MEF2, NKx, NFAT transcription factors in human heart failurehuman heart failure

Potential role for FOX factors and IRFPotential role for FOX factors and IRF

What about early events?What about early events?

Mice with infarcts and sham operated controls sacrificed at varying times after surgery (1, 4, 8, 24 hrs, 8 wks)

Analysis of differentially co-regulated gene clusters reveal consistent set of transcription factors.

FOX factor SummaryFOX factor Summary

FOX targets change substantially in advanced human FOX targets change substantially in advanced human HF and in early HF in mice.HF and in early HF in mice.

FOX factors are present in human heart at FOX factors are present in human heart at physiologic levels: FOXP1, P4, C1, C2, J2physiologic levels: FOXP1, P4, C1, C2, J2

FOXP1 is localized to nuclei of human cardiac FOXP1 is localized to nuclei of human cardiac myocytes.myocytes.

Do FOX factors mediate cardiac hypertrophy?Do FOX factors mediate cardiac hypertrophy?

Hannenhalli et al. Circulation, 2006

Naïve (N)

Conditioned Stimulus only (CS)

Fear Conditioned (FC)

Gene Regulation in Learning and Memory

Hippocampus

Amygdala

Keeley et al. Memory and Learning, 2006

Immediate Early Gene Expression is Immediate Early Gene Expression is Regulated by Many Transcription FactorsRegulated by Many Transcription Factors

http://web1.tch.harvard.edu/research/greenberg/oldsite/Pathways.html

50 Most Significantly Regulated Genes 50 Most Significantly Regulated Genes were Used for Further Analysiswere Used for Further Analysis

rank Symbol Molecular Role N (log2) CS vs N (%) FC vs N (%)

1 Fos DNA-binding transcription factor 5.4 134 1532 Ssty1 unknown 3.9 -22 -243 Ssty2 unknown 4.8 -27 -314 Dusp1 Phosphatase 7.4 33 365 Cd84 cell adhesion 7.6 -30 -336 Pura DNA-binding transcription factor 7.7 41 337 Nr4a1 DNA-binding transcription factor 7.2 33 408 Egr1 DNA-binding transcription factor 8.4 27 309 Cacna2d1 voltage dependent calcium channel 6.7 31 34

10 Junb DNA-binding transcription factor 7.6 20 24

rank Symbol Molecular Role N (log2) CS vs N (%) FC vs N (%)

1 Junb DNA-binding transcription factor 7.11 32 552 Fos DNA-binding transcription factor 5.26 123 2003 Nr4a1 DNA-binding transcription factor 6.44 28 434 Ier2 unknown 3.97 30 385 ly6e unknown 6.47 17 216 Stk19 serine/threonine kinase 6.61 14 147 Gadd45g upstream activator of p38 and JNK MAPKs 6.18 18 268 Egr1 DNA-binding transcription factor 7.98 32 459 Aaas nuclear pore/adapter 5.42 21 2510 Mlf2 unknown 9.54 13 18

Hippocampus

Amygdala

Hippocampus- and Amygdala-specific Hippocampus- and Amygdala-specific promoter modeling promoter modeling

Hippocampus: Hippocampus:

CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Nrf-1 Nrf-1

Amygdala: Amygdala:

CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Ets1, Elk1, Ets1, Elk1, Myc/Max, USFMyc/Max, USF

Promoter models were able to predict Promoter models were able to predict regulation of less significant genes with regulation of less significant genes with

some system specificity some system specificity

A. Genes Predicted by Hippocampus Promoter Model

0%

2%

4%

6%

8%

Hippocampus Amygdala

Tissue Examined

Average Change FC vs N

B. Genes Predicted by Amygdala Promoter Model

0%

2%

4%

6%

8%

Hippocampus Amygdala

Tissue Examined

Average Change FC vs N

Core promoter prediction

TF-DNA binding

TF-TF interactions

Transcriptional Modules

Applications

Overview

Core Promoter : Minimal DNA sequence required for the assembly of the Pre-initiation complex (~100 bps flanking the TSS)

Goal : Determine sequence properties responsible for precise Pol-II localiazation

1990 1995 2000 2006

TATA

PromoterScan

Promoter1.0

Autogene

PromFind

TSSG

Calverie

NNPP

CorePromoter

PromoterInspector

Hannenhalli

FirstEF

Dragon

PSPA

CpG island line

CpG Islands

Unmethylated GC-rich regions (experimental)

GC-rich regions ( 200 bp) on the genome with high CG di-nucleotide frequency (computational)

6.05.0 ≥≥+GC

CGGC ff

fANDf

Gardiner-Garden and Frommer, 1987

About half of all genes have a CpG island overlapping the first exon.

Antequera and Bird, 1993

Long range sequence Characteristics(10kb)

Short genomicSub regional signal,

eg. CpG island(0.5~2kb) Specific cis elements (eg. TATA)

Categories of DNA sequence “signals” used in promoter prediction

TSS

Generalization of Markov Models

Wang and Hannenhalli, BMC BI, 2005

Position Specific Propensity Analysis (PSPA)Position Specific Propensity Analysis (PSPA)

PSPA based Model

Use +-100bp around TSS as training

Wang and Hannenhalli, BBRC, 2006

Overlap between prediction toolsOverlap between prediction tools

Carninci et al. (2006). "Genome-wide analysis of mammalian promoter architecture and evolution." Nat Genet 38(6): 626-635.

CpG poor promoters have greater conservation and CpG poor promoters have greater conservation and fewer aTSS and mostly involved in extra-cellular and fewer aTSS and mostly involved in extra-cellular and stress-response activities.stress-response activities.

By including position specific motifs and their co-By including position specific motifs and their co-occurrence, PSPA improves the Transcription Start site occurrence, PSPA improves the Transcription Start site localization.localization.

Many Position Specific elements are associated with Many Position Specific elements are associated with target gene function.target gene function.

There is little overlap among various state-of-the-art There is little overlap among various state-of-the-art prediction tools.prediction tools.

Alternative promoters have tissue specific usageAlternative promoters have tissue specific usage

Acknowledgement

Junwen Wang PCBI, UPennLarry Singh PCBI, UPennLi-San Wang Biology, UPennShane Jensen Statistics, Wharton, UPenn

Perry EvansGreg Donahue Genomics and Comp Bio, Upenn

Tom Cappola Cardiology, UPenn

Mike Keeley Biology, UpennTed Abel Biology, Upenn