1
Predicting causes and effects in regulatory networks Problem description Connecting regulators with the genes that they regulate is difficult. While various large scale data sources are nowadays available (expression, literature, chip-chip, protein interactions, etc.), each of these is in itself an incomplete view of the regulatory network. Our goal is to integrate them. We build a network that is able to predict how a perturbation in a gene S will affect other genes T. Perturbation (S) Targets (T) Regulatory network (protein-protein interactions, transcription factors, phosphorylation) Using this predictor, we want to be able to: - given measured effects (T i ), determine most likely causal perturbation (S) - identify possible unmeasured effects (G i ) in the network (e.g. protein state changes) - find genes which form a robust group of markers (T i ) for a certain perturbation (S) S Gi Ti S Gi Ti S Gi Ti Approach Marc Hulsman, Marcel J.T. Reinders Delft Bioinformatics Lab, Faculty of Electical Engineering, Mathematics and Computer Science Delft University of Technology Perturbation G 1 G 2 G 4 G 3 S T Various (high-througput) genome-scale data sources, containing information on different types of cellular networks, are used to determine the connectivity between source and target gene: score of P(S,T) Classifier We train a predictive rule which takes the connectivity data between S and T into account to determine a score for P(S,T), describing how likely it is that a perturbation of S will affect T. This rule is trained by using knockout microarray expression data, for which we both know the perturbed gene as well the genes that show an effect. protein-protein interaction transcription factors phosphorylation chip-chip literature coexpression (growth) chip-chip tandem-affinity purification 2-hybrid coexpression (time) 2-hybrid 2-hybrid chip-chip literature Proteome chip chip-chip Method Results 1) Network construction Integrate evidence Integrate evidence Integrate evidence Protein-protein interactions Protein-DNA interactions Phosphorylation interactions Knockout microarray experiments Positive example Negative example Knockout - calculate probability of a gene to be affected by a defined knockout - uses message passing based Monte Carlo method 2) Network simulation - optimize network construction based on training set of knockouts - evolutionary strategy (CMA-ES) optimizing ROC-based score 3) Network optimization Network construction and data integration parameters 4) Validation on test set of knockouts - using cross validation Validation Discussion Building genome-wide, simulation models of cellular networks is difficult, as learning the activity of each of the millions of possible interactions easily leads to large numbers of false positives. We simplify the problem, by learning a more general rule on how to integrate various data sources, using them to determine the regulatory activity of the interactions, thereby enabling predictions of causes/effects in regulatory networks. We hope to transfer this knowledge to other species, enabling us to make maximally use of the available information in inferring regulatory networks. We perform in silico-validatioin. We leave a set of gene perturbation microarray experiments out of the data that is used for making predictions. Next, we check if the predictions that we make are validated by this left-out data The prediction method can handle path constraints, e.g. only considering effect propagations that end in a protein-DNA interaction (i.e. expression-changing). This allows for training/validation using microarray results. Affected? Not affected Affected { Effect prediction within the yeast MAPK pathway 0 50 100 150 200 250 False positives 0.0 0.2 0.4 0.6 0.8 1.0 True positives Perturbation cause prediction performance All (638 exp.) Transcription factors (258 exp.) Kinases (150 exp.) Other (238 exp.) TF knockouts (Chua et al.) (49 exp.) TF overexpression (Chua et al.) (53 exp.) Random guessing 0 50 100 150 200 250 False positives 0.0 0.1 0.2 0.3 0.4 0.5 0.6 True positives Perturbation effect prediction performance All (779 exp.) TF knockouts (Chua et al.) (50 exp.) TF overexpression (Chua et al.) (53 exp.) Topology-based predictor Permuted genes predictor Random guessing all chromosome 100 nearest genes 50 nearest genes The subset causal genes are selected from 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AUC50 score causal gene prediction Causal gene prediction when genomic location is approximately known Cause optimized Effect optimized b) SWI4 neighborhood in MAPK (max path length 2) STE20 neighborhood in MAPK (max path length 2) Perturbation cause prediction a) A significant fraction of the measured effects can be predicted, from among all genes in the yeast network. Surprisingly, the network simulator is a much better predictor of effects, compared to replicate perturbations (i.e perturbations of the same gene obtained from different datasets) being used to predict the effects of each other. b) Similarly, a predictor based just on high quality network topology (shortest paths) also performs worse compared to the network simulation. That the network topology still plays a significant role is however shown by the low performance of the permuted network simulation, for which the genes were permuted across the network. Part of the non-predicted effects are due to indirect effects such as stress response. For instance, the most often affected gene in the knockout measurements is HSP30 (a stress response gene) . This is corroborated by gene overexpression effects being easier to predict then gene knockout effects. Perturbation effect prediction a) Based on observed effects, the method can predict the most likely causal genes. We find that regulatory causal genes are much easier to predict compared to non-regulatory genes, likely due to the fact that only regulatory effects are encoded in the network topology. b) A common usage pattern for these type of predictions will be one in which causal genes have to be determined from among a certain subselection of genes (e.g. genes affected by mutations). In these cases, performance can become significantly higher. Also, in this plot we show the difference in performance for a network optimized for causal prediction, and a network optimized for effect prediction. a) a) Effect predictions in the KEGG MAPK pathway. Blue lines indicate pathway edges, and their thickness indicates regulatory influence (inferred by the algorithm). Arrow lines show predictions (i.e. ranking in top 250) that were validated (p-value < 0.001), with the color indicating the number of false positives ranking higher within the MAPK pathway (ranging from 0 to 8, 69% <= 3). Note that the algorithm predicts feed-forward, feedback and even cross-talk effects accurately. This is possible by simulating the pathway not on its own, but within its complete cellular context. Dense neighborhoods While pathway maps often suggest relatively simple interaction topologies, in reality, the number of interactions that have been measured among even a relatively limited number of genes is often rather large, complicating accurate perturbation effect predictions. This underlines the need for the use of multiple information sources to determine the regulatory activity of these interactions. In the figures to the left, we show the neighborhood of two genes within the MAPK pathway. One is a transcription factor (SWI4), the other (STE20) is a kinase. The node color indicates if an effect has been observed in microarray studies, after perturbation of respectively SWI4 or STE20 (green: p-value < 0.001, orange: p-value < 0.05). The thickness of node borders indicate if the network simulation predicts an pertubation effect or not. Only the highest quality interactions are shown (STRING score > 900), with their predicted regulatory activity being indicated by their color. Using 10-fold cross-validation, the algorithm was validated on unseen gene perturbation microarray experiments. b) K T1 T2

Predicting causes and effects in regulatory networks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting causes and effects in regulatory networks

Predicting causes and effects in regulatory networks

Problem descriptionConnecting regulators with the genes that they regulate is difficult. While various large scale data sources are nowadays available (expression, literature, chip-chip, protein interactions, etc.), each of these is in itself an incomplete view of the regulatory network.

Our goal is to integrate them. We build a network that is able to predict how a perturbation in a gene S will affect other genes T.

Perturbation(S)

Targets(T)

Regulatory network(protein-protein interactions, transcription factors, phosphorylation)

Using this predictor, we want to be able to:- given measured effects (Ti), determine most likely causal perturbation (S)

- identify possible unmeasured effects (Gi) in the network (e.g. protein state changes)

- find genes which form a robust group of markers (Ti) for a certain perturbation (S)

S Gi Ti

S Gi Ti

S Gi Ti

Approach

Marc Hulsman, Marcel J.T. ReindersDelft Bioinformatics Lab, Faculty of Electical Engineering, Mathematics and Computer Science

Delft University of Technology

Perturbation

G1

G2

G4

G3

S T

Various (high-througput) genome-scale data sources, containing information on different types of cellular networks, are used to determine the connectivity between source and target gene:

score of P(S,T)

Classifier

We train a predictive rule which takes the connectivity data between S and T into account to determine a score for P(S,T), describing how likely it is that a perturbation of S will affect T.

This rule is trained by using knockout microarray expression data, for which we both know the perturbed gene as well the genes that show an effect.

protein-protein interaction

transcription factors

phosphorylation

chip-chipliteraturecoexpression (growth)

chip-chip

tandem-affinity purification2-hybridcoexpression (time)

2-hybrid

2-hybrid

chip-chipliterature

Proteome chip

chip-chip

Method

Results

1) N

etw

ork

co

nst

ruct

ion

Integrate evidence

Integrate evidence

Integrate evidence

Protein-protein interactions

Protein-DNA interactions

Phosphorylation interactions

Knockout microarray experiments

Positive example

Negative example

Knockout

- calculate probability of a gene to be affected by a defined knockout- uses message passing based Monte Carlo method

2) Network simulation

- optimize network construction based on training set of knockouts- evolutionary strategy (CMA-ES) optimizing ROC-based score

3) Network optimizationNetwork construction and

data integration parameters

4) Validation on test set of knockouts- using cross validation

Validation

DiscussionBuilding genome-wide, simulation models of cellular networks is difficult, as learning the activity of each of the millions of possible interactions easily leads to large numbers of false positives.

We simplify the problem, by learning a more general rule on how to integrate various data sources, using them to determine the regulatory activity of the interactions, thereby enabling predictions of causes/effects in regulatory networks.

We hope to transfer this knowledge to other species, enabling us to make maximally use of the available information in inferring regulatory networks.

We perform in silico-validatioin. We leavea set of gene perturbation microarray experiments out of the data that is used formaking predictions. Next, we check if thepredictions that we make are validated by this left-out data

The prediction method can handle path constraints, e.g. only considering effect propagations that end in a protein-DNA interaction (i.e. expression-changing). This allows for training/validation using microarray results.

Affected?

Not affected

Affected

{Effect prediction within the yeast MAPK pathway

0 50 100 150 200 250False positives

0.0

0.2

0.4

0.6

0.8

1.0

Truepositives

Perturbation cause prediction performance

All (638 exp.)Transcription factors (258 exp.)Kinases (150 exp.)Other (238 exp.)TF knockouts (Chua et al.) (49 exp.)TF overexpression (Chua et al.) (53 exp.)Random guessing

0 50 100 150 200 250False positives

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Truepositives

Perturbation effect prediction performance

All (779 exp.)TF knockouts (Chua et al.) (50 exp.)TF overexpression (Chua et al.) (53 exp.)Topology-based predictorPermuted genes predictorRandom guessing

all chromosome 100 nearest genes 50 nearest genesThe subset causal genes are selected from

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

AUC50score

causalgeneprediction

Causal gene prediction when genomic location is approximately known

Cause optimizedEffect optimized

b)

SWI4 neighborhood in MAPK (max path length 2) STE20 neighborhood in MAPK (max path length 2)

Perturbation cause predictiona) A significant fraction of the measured effects can be predicted, from among all genes in the yeast network. Surprisingly, the network simulator is a much better predictor of effects, compared to replicate perturbations (i.e perturbations of the same gene obtained from different datasets) being used to predict the effects of each other.

b) Similarly, a predictor based just on high quality network topology (shortest paths) also performs worse compared to the network simulation. That the network topology still plays a significant role is however shown by the low performance of the permuted network simulation, for which the genes were permuted acrossthe network.

Part of the non-predicted effects are due to indirect effects such as stress response. For instance, the mostoften affected gene in the knockout measurements is HSP30 (a stress response gene) . This is corroboratedby gene overexpression effects being easier to predict then gene knockout effects.

Perturbation effect predictiona) Based on observed effects, the method can predict the most likely causal genes. We find that regulatory causal genes are much easier to predict compared to non-regulatory genes, likely due to the fact that only regulatory effects are encoded in the network topology.

b) A common usage pattern for these type of predictions will be one in which causal genes have to be determined from among a certain subselection of genes (e.g. genes affected by mutations). In these cases, performance can become significantly higher. Also, in this plot we show the difference in performance for a network optimized for causal prediction, and a network optimized for effect prediction.

a)a)

Effect predictions in the KEGG MAPK pathway. Blue lines indicate pathway edges, and their thickness indicates regulatory influence (inferred by the algorithm). Arrow lines show predictions (i.e. ranking in top 250) that were validated (p-value < 0.001), with the color indicating the number of false positives ranking higher within the MAPK pathway (ranging from 0 to 8, 69% <= 3).

Note that the algorithm predicts feed-forward, feedback and even cross-talk effects accurately. This is possible by simulating the pathway not on its own, but within its complete cellular context.

Dense neighborhoodsWhile pathway maps often suggest relatively simple interaction topologies, in reality, the numberof interactions that have been measured among even a relatively limited number of genes is oftenrather large, complicating accurate perturbation effect predictions. This underlines the need for the use of multiple information sources to determine the regulatory activity of these interactions.

In the figures to the left, we show the neighborhood of two genes within the MAPK pathway. One is a transcription factor (SWI4), the other (STE20) is a kinase. The node color indicates if an effect has been observed in microarray studies, after perturbation of respectively SWI4 or STE20(green: p-value < 0.001, orange: p-value < 0.05). The thickness of node borders indicate if the network simulation predicts an pertubation effect or not.

Only the highest quality interactions are shown (STRING score > 900), with their predicted regulatory activity being indicated by their color.

Using 10-fold cross-validation, the algorithm was validated on unseen gene perturbation microarray experiments.

b)

K

T1

T2