View
24
Download
0
Category
Tags:
Preview:
DESCRIPTION
Challenges in the Computational Modeling of Gene Regulation. Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/ ~ langley - PowerPoint PPT Presentation
Citation preview
Pat LangleyPat LangleyInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise
Palo Alto, CaliforniaPalo Alto, Californiaandand
Center for the Study of Language and InformationCenter for the Study of Language and InformationStanford University, Stanford, CaliforniaStanford University, Stanford, California
http://cll.stanford.edu/~langleyhttp://cll.stanford.edu/~langley
langley@csli.stanford.edulangley@csli.stanford.edu
Challenges in the ComputationalChallenges in the ComputationalModeling of Gene RegulationModeling of Gene Regulation
Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, and H. Spencer. and H. Spencer.
Themes in Computational BiologyThemes in Computational Biology
manual development of knowledge bases about biological manual development of knowledge bases about biological systems (e.g., E-Cell, EcoCyc, KEGG);systems (e.g., E-Cell, EcoCyc, KEGG);
automated analysis of available genomic and express data automated analysis of available genomic and express data (e.g., clustering, inferring regulatory networks); (e.g., clustering, inferring regulatory networks);
interactive tools for visualizing genomic and expression data interactive tools for visualizing genomic and expression data (e.g., SpotFire).(e.g., SpotFire).
Three distinctive themes in computational biology have been:Three distinctive themes in computational biology have been:
However, each approach by itself is incomplete, and a complete However, each approach by itself is incomplete, and a complete solution must combine knowledge, data, and user interaction. solution must combine knowledge, data, and user interaction.
DiscoveryDiscovery
Domain knowledgeDomain knowledge
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
Experimental dataExperimental data
UpdatedUpdatedmodelmodel ×
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
--
++ ++
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
++
LightLight
++ ×
BiologistBiologist
Knowledge, Data, and the BiologistKnowledge, Data, and the Biologist
Observed expressionObserved expression levels from cDNAlevels from cDNA microarrays or frommicroarrays or from other sourcesother sources
Revised model of geneRevised model of gene regulatory processes regulatory processes that explains observedthat explains observed expression dataexpression data
Initial model of geneInitial model of gene regulatory processes, regulatory processes, gene ontologies, and gene ontologies, and biological constraintsbiological constraints
DiscoveryDiscovery
Domain knowledgeDomain knowledgeExperimental dataExperimental data
UpdatedUpdatedmodelmodel
BiologistBiologist
Knowledge, Data, and the BiologistKnowledge, Data, and the Biologist
Cyanobacteria are 3.5 billion years old and created Earth’s early Cyanobacteria are 3.5 billion years old and created Earth’s early oxygen atmosphere.oxygen atmosphere.
Algae and Cyanobacteria produce most of the oxygen we breath Algae and Cyanobacteria produce most of the oxygen we breath and fix most greenhouse carbon dioxide. and fix most greenhouse carbon dioxide.
Reasons for Studying CyanobacteriaReasons for Studying Cyanobacteria
Thus, together they form the base of the marine ecosystem.Thus, together they form the base of the marine ecosystem.
Collecting Data on Photosynthetic ProcessesCollecting Data on Photosynthetic Processes
Stress (e.g., High Light)Stress (e.g., High Light)
Adaptation PeriodAdaptation Period
Sampling mRNA/cDNASampling mRNA/cDNA
Equlibrium PeriodEqulibrium Period
MicroarrayMicroarrayTraceTrace
Continuous Culture (Chemostat)Continuous Culture (Chemostat)
/wwwscience.murdoch.edu.au/teach
www.affymetrix.com/
www.affymetrix.com/
Hea
lth
of C
ultu
reH
ealt
h of
Cul
ture
TimeTime
http://www.bio.ic.ac.uk/research/barber/photosystemII.htmlhttp://www.bio.ic.ac.uk/research/barber/photosystemII.html
A Biologist’s Depiction of PhotosynthesisA Biologist’s Depiction of Photosynthesis
Challenge 1: Representing Biological ModelsChallenge 1: Representing Biological Models
qualitativequalitative rather than quantitative; rather than quantitative; abstractabstract in that they ignore many details; in that they ignore many details; causalcausal in that they describe chains of effects; in that they describe chains of effects; involve involve processesprocesses that involve biological mechanisms. that involve biological mechanisms.
To assist biologists in their modeling efforts, we must first encode To assist biologists in their modeling efforts, we must first encode candidate models; however, most biological models are: candidate models; however, most biological models are:
We need some formal way to represent such models that can be We need some formal way to represent such models that can be interpreted computationally. interpreted computationally.
Some Representations of Biological KnowledgeSome Representations of Biological Knowledge
taxonomiestaxonomies
differentialdifferentialequationsequations
BooleanBooleannetworksnetworks
BayesianBayesiannetworksnetworks
How do plants modify their photosynthetic apparatus in high light?How do plants modify their photosynthetic apparatus in high light?
An Abstract Qualitative Causal ModelAn Abstract Qualitative Causal Model
dspAdspA
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++
++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
This model is qualitative but relates continuous variables, much asThis model is qualitative but relates continuous variables, much asformalisms from qualitative physics (e.g., Forbus, 1984).formalisms from qualitative physics (e.g., Forbus, 1984).
Challenge 2: Making Predictions from ModelsChallenge 2: Making Predictions from Models
that some partial correlations will be zero;that some partial correlations will be zero; that some partial correlation products will be equal; and that some partial correlation products will be equal; and the signs of correlations between variables.the signs of correlations between variables.
To evaluate a regulatory model, it must make predictions about To evaluate a regulatory model, it must make predictions about quantitative measures of gene expression. quantitative measures of gene expression.
These predictions assume each that variable is a linear function of These predictions assume each that variable is a linear function of its causal parents, as in Glymour et al.’s (1987) Tetrad. its causal parents, as in Glymour et al.’s (1987) Tetrad.
A qualitative model cannot predict numeric values but can predict:A qualitative model cannot predict numeric values but can predict:
Some models must also include statements that certain regulatory Some models must also include statements that certain regulatory pathways dominate others. pathways dominate others.
Implications of Three Causal ModelsImplications of Three Causal Models
XX YY ZZ
XX YY ZZ
XX
YY ZZ
XZ.YXZ.Y = 0 = 0
XZ.YXZ.Y 0 0
XZ.YXZ.Y 0 0
Note that these implications do not depend on the effect’s sign.Note that these implications do not depend on the effect’s sign.
Challenge 3: Encoding Background KnowledgeChallenge 3: Encoding Background Knowledge
an initial qualitative model of gene regulation;an initial qualitative model of gene regulation; genes that may be involved in the phenomena; genes that may be involved in the phenomena; a taxonomy of these relevant genes; anda taxonomy of these relevant genes; and constraints on links between types of genes.constraints on links between types of genes.
To constrain candidate models, we must encode knowledge about To constrain candidate models, we must encode knowledge about biological entities and processes. biological entities and processes.
This background knowledge can take the form of:This background knowledge can take the form of:
Analysis of biological data should take into account knowledge Analysis of biological data should take into account knowledge about the organism under study. about the organism under study.
We can start with an initial causal model proposed by biologists.We can start with an initial causal model proposed by biologists.
Some Constraints on Biological ModelsSome Constraints on Biological Models
We can also forbid causal links between certain pairs of variables.We can also forbid causal links between certain pairs of variables.
dspAdspA
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++
++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
××
Challenge 4: Revising Models Given Expression DataChallenge 4: Revising Models Given Expression Data
the the initial stateinitial state from which to start search; from which to start search; the the operatorsoperators that generate new states; that generate new states; the the evaluation functionevaluation function that selects among states; that selects among states; the overall the overall control regimecontrol regime for the search; and for the search; and the the halting criterionhalting criterion for ending the search. for ending the search.
To revise a regulatory model, we must develop an algorithm that To revise a regulatory model, we must develop an algorithm that searches through the space of models. searches through the space of models.
This requires us to make design decisions about:This requires us to make design decisions about:
We have implemented a two-stage method to search the space of We have implemented a two-stage method to search the space of qualitative causal models of gene regulation. qualitative causal models of gene regulation.
Stage 1: Determining Model StructureStage 1: Determining Model Structure
Initial state:Initial state: A preliminary model proposed by a biologist. A preliminary model proposed by a biologist.
Operators:Operators: Add a new link (constrained by variable types); Add a new link (constrained by variable types); Delete an existing link. Delete an existing link.
Evaluation:Evaluation: Agreement with predicted relations among partial Agreement with predicted relations among partial correlations, similar to those used in Tetrad.correlations, similar to those used in Tetrad.
Control:Control: Greedy search to select best structure on each round. Greedy search to select best structure on each round.
Halting:Halting: Stop when there is no further improvement in the Stop when there is no further improvement in the evaluation metric. evaluation metric.
Our system carries out heuristic search through the space of causal Our system carries out heuristic search through the space of causal model structures. model structures.
Stage 2: Adding Signs to the ModelStage 2: Adding Signs to the Model
Initial state:Initial state: The unsigned model structure generated in Stage 1. The unsigned model structure generated in Stage 1.
Operators:Operators: Associate a sign (+ or –) with a given link; Associate a sign (+ or –) with a given link; Label some pathways as dominant over others. Label some pathways as dominant over others.
Evaluation:Evaluation: Agreement with the signs of correlations computed Agreement with the signs of correlations computed from the data.from the data.
Control:Control: Exhaustive search for small models; Exhaustive search for small models; Greedy search for more complex models.Greedy search for more complex models.
Halting:Halting: Stop when each link has an associated sign. Stop when each link has an associated sign.
Our system carries out a second search through the space of signed Our system carries out a second search through the space of signed qualitative models. qualitative models.
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3 4 5 6
NBLR
NBLA
cpcB
psbA2
psbA1
dspA
PBS
Expression Data on Photosynthetic RegulationExpression Data on Photosynthetic Regulation
Initial study produced four replications at each of five time steps.Initial study produced four replications at each of five time steps.
Changes to the model improve its match to the expression data.Changes to the model improve its match to the expression data.
A Revised Model of Photosynthesis RegulationA Revised Model of Photosynthesis Regulation
--
++
dspAdspA
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
LightLight
++ × ×
Similar changes adapt the model to expression data from mutantsSimilar changes adapt the model to expression data from mutants ..
Challenge 5: Dealing with Small Data SetsChallenge 5: Dealing with Small Data Sets
starts from an initial model rather than from scratch;starts from an initial model rather than from scratch; incorporates biological constraints on model revisions; incorporates biological constraints on model revisions; uses bootstrap sampling to generate 20 data sets, then runs the uses bootstrap sampling to generate 20 data sets, then runs the
revision method 20 times and retains only changes that occur revision method 20 times and retains only changes that occur in at least 75% of the runs.in at least 75% of the runs.
Microarray technology provides many measurements but it often Microarray technology provides many measurements but it often gives very few gives very few samplessamples. .
To reduce variance and avoid overfitting these data, our method: To reduce variance and avoid overfitting these data, our method:
Experimental studies suggest that these strategies reduce variance Experimental studies suggest that these strategies reduce variance and produce more robust models. and produce more robust models.
Experimental Studies with Synthetic DataExperimental Studies with Synthetic Data
To evaluate our revision method, we used a target model to create To evaluate our revision method, we used a target model to create synthetic data and systematically varied distance from that model. synthetic data and systematically varied distance from that model.
0
0.5
1
1.5
2
2.5
3
0 1 2 4 6
Model Errors
Stru
ctur
e C
hang
es
CorrectionsErrors
The number of incorrect revisions seems unaffected by distance.The number of incorrect revisions seems unaffected by distance.
represent biological models with time-delayed effects;represent biological models with time-delayed effects;
utilize these time-delayed models to make predictions;utilize these time-delayed models to make predictions;
evaluate alternative models in terms of their fit to data;evaluate alternative models in terms of their fit to data;
carry out search through the space of alternative models. carry out search through the space of alternative models.
Many biological processes occur over extended periods of time; Many biological processes occur over extended periods of time; to deal with such phenomena, we need methods that:to deal with such phenomena, we need methods that:
We have extended our framework to handle qualitative causal We have extended our framework to handle qualitative causal models with time delays and we have done initial evaluations. models with time delays and we have done initial evaluations.
Challenge 5: Dealing with Temporal PhenomenaChallenge 5: Dealing with Temporal Phenomena
We can handle temporal phenomena by adding time delays to links.We can handle temporal phenomena by adding time delays to links.
A Regulatory Model with Time DelaysA Regulatory Model with Time Delays
dspAdspA
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
1313
77
2020
33
1717
1717
1717
psbA1psbA1
psbA2psbA2
cpcBcpcB
88
9966
1616
LightLight
66
1515
This model predicts the system’s qualitative behavior over time.This model predicts the system’s qualitative behavior over time.
Synthetic Data from Time-Delay ModelSynthetic Data from Time-Delay Model
0 10 20 30 40 50 60 70 80 90 1005
10
15
20
25
30
Light
NBLA Health
A Method for Revising Time-Delay ModelsA Method for Revising Time-Delay Models
Generalize correlation and partial correlation to frequency domain.Generalize correlation and partial correlation to frequency domain.
Our method reconstructs most of this model from synthetic data.Our method reconstructs most of this model from synthetic data.
A Reconstructed Model with Time DelaysA Reconstructed Model with Time Delays
dspAdspA
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
1313
77
2020
33
1717
1717
1717
psbA1psbA1
psbA2psbA2
cpcBcpcB
88
9966
1616
LightLight
66
×
Determining the link delays from time series seems tractable, Determining the link delays from time series seems tractable, butbut this requires a high sampling rate.this requires a high sampling rate.
specify qualitative causal models of biological systems;specify qualitative causal models of biological systems;
display and edit a model’s structure and details graphically;display and edit a model’s structure and details graphically;
incorporate knowledge and results from previous studies;incorporate knowledge and results from previous studies;
evaluate the evidence in favor of specific hypotheses; evaluate the evidence in favor of specific hypotheses;
propose revisions to the model in response to observations.propose revisions to the model in response to observations.
We are developing an environment that lets its biologist users:We are developing an environment that lets its biologist users:
The environment will offer computational assistance in forming The environment will offer computational assistance in forming and evaluating models but let the biologist retain control. and evaluating models but let the biologist retain control.
Challenge 6: Interfacing with BiologistsChallenge 6: Interfacing with Biologists
An Interactive Environment for Biological ModelingAn Interactive Environment for Biological Modeling
Additional Work on Biological ModelingAdditional Work on Biological Modeling
developing other approaches to revising regulatory models, developing other approaches to revising regulatory models, including Bayesian scoring and neural networks;including Bayesian scoring and neural networks;
introducing taxonomic knowledge about genes and biological introducing taxonomic knowledge about genes and biological processes to constrain the search process; andprocesses to constrain the search process; and
expanding the modeling formalism to represent biological expanding the modeling formalism to represent biological mechanisms in addition to abstract processes. mechanisms in addition to abstract processes.
Our ongoing research on biological model revision has involved:Our ongoing research on biological model revision has involved:
Thus, we continue to explore ways to combine knowledge with Thus, we continue to explore ways to combine knowledge with data to aid the creation of biological models. data to aid the creation of biological models.
Additional Models and DataAdditional Models and Data
naturalistic data on photosynthesis regulation in Cyanobacteria naturalistic data on photosynthesis regulation in Cyanobacteria in a setting that mimics the day/night cycle;in a setting that mimics the day/night cycle;
testing if certain genes are targets of unobserved transcription testing if certain genes are targets of unobserved transcription factors, using time-series data on the yeast cell cycle;factors, using time-series data on the yeast cell cycle;
testing whether the transcription factor c-Jun is activated by testing whether the transcription factor c-Jun is activated by anything other than Jnk2, using data on healthy lung tissue.anything other than Jnk2, using data on healthy lung tissue.
We are also applying our biological modeling framework to:We are also applying our biological modeling framework to:
These efforts should further test the robustness of our approach These efforts should further test the robustness of our approach and provide evidence of its generality. and provide evidence of its generality.
Intellectual InfluencesIntellectual Influences
qualitative physics and simulation (e.g., Forbus, 1984); qualitative physics and simulation (e.g., Forbus, 1984);
linear causal models and their inference (Glymour et al., 1987);linear causal models and their inference (Glymour et al., 1987);
computational scientific discovery (e.g., Langley et al., 1987); computational scientific discovery (e.g., Langley et al., 1987);
theory revision in machine learning (e.g., Towell, 1991);theory revision in machine learning (e.g., Towell, 1991);
interactive tools for data analysis (e.g., Schneiderman, 2001).interactive tools for data analysis (e.g., Schneiderman, 2001).
Our approach to computational biological discovery borrows ideas Our approach to computational biological discovery borrows ideas from many traditions:from many traditions:
Our work combines, in novel ways, insights from machine learning, Our work combines, in novel ways, insights from machine learning, knowledge representation, and human-computer interaction.knowledge representation, and human-computer interaction.
Contributions of the ResearchContributions of the Research
representing biological models that are qualitative and abstract;representing biological models that are qualitative and abstract;
making testable predictions from such qualitative causal models;making testable predictions from such qualitative causal models;
encoding knowledge about biological entities and processes; encoding knowledge about biological entities and processes;
utilizing knowledge and data to revise initial process models;utilizing knowledge and data to revise initial process models;
making revision methods robust despite small amounts of data;making revision methods robust despite small amounts of data;
developing interactive tools that let biologists remain in control.developing interactive tools that let biologists remain in control.
In summary, our work on computational biological modeling and In summary, our work on computational biological modeling and discovery responds to six major challenges:discovery responds to six major challenges:
Taken together, our six responses constitute a novel and promising Taken together, our six responses constitute a novel and promising approach to elucidating biological models. approach to elucidating biological models.
Pat LangleyPat Langley
Jeff ShragerJeff ShragerInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise
Palo Alto, CaliforniaPalo Alto, Californiaandand
Andrew PohorilleAndrew PohorilleCenter for Computational AstrobiologyCenter for Computational Astrobiology
NASA Ames Research CenterNASA Ames Research Center
Moffett Field, CaliforniaMoffett Field, California
Revising Qualitative ModelsRevising Qualitative Modelsof Gene Regulationof Gene Regulation
Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer. Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer.
Greedy Search Through a Space of ModelsGreedy Search Through a Space of Models
Initial model
Revision 1.1 Revision 1.2 Revision 1.3 Revision 1.4
Revision 2.1 Revision 2.2 Revision 2.3 Revision 2.4
Revision 3.1 Revision 3.2 Revision 3.3 Revision 3.4
Recommended