36
Pat Langley Pat Langley Institute for the Study of Learning and Expertise Institute for the Study of Learning and Expertise Palo Alto, California Palo Alto, California and and Center for the Study of Language and Information Center for the Study of Language and Information Stanford University, Stanford, California Stanford University, Stanford, California http://cll.stanford.edu/~langley http://cll.stanford.edu/~langley [email protected] [email protected] Challenges in the Challenges in the Computational Computational Modeling of Gene Regulation Modeling of Gene Regulation S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Sh S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Sh encer. encer.

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

  • Upload
    belden

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Challenges in the Computational Modeling of Gene Regulation. Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/ ~ langley - PowerPoint PPT Presentation

Citation preview

Pat LangleyPat LangleyInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise

Palo Alto, CaliforniaPalo Alto, Californiaandand

Center for the Study of Language and InformationCenter for the Study of Language and InformationStanford University, Stanford, CaliforniaStanford University, Stanford, California

http://cll.stanford.edu/~langleyhttp://cll.stanford.edu/~langley

[email protected]@csli.stanford.edu

Challenges in the ComputationalChallenges in the ComputationalModeling of Gene RegulationModeling of Gene Regulation

Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, and H. Spencer. and H. Spencer.

Themes in Computational BiologyThemes in Computational Biology

manual development of knowledge bases about biological manual development of knowledge bases about biological systems (e.g., E-Cell, EcoCyc, KEGG);systems (e.g., E-Cell, EcoCyc, KEGG);

automated analysis of available genomic and express data automated analysis of available genomic and express data (e.g., clustering, inferring regulatory networks); (e.g., clustering, inferring regulatory networks);

interactive tools for visualizing genomic and expression data interactive tools for visualizing genomic and expression data (e.g., SpotFire).(e.g., SpotFire).

Three distinctive themes in computational biology have been:Three distinctive themes in computational biology have been:

However, each approach by itself is incomplete, and a complete However, each approach by itself is incomplete, and a complete solution must combine knowledge, data, and user interaction. solution must combine knowledge, data, and user interaction.

DiscoveryDiscovery

Domain knowledgeDomain knowledge

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++ ++

--

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

Experimental dataExperimental data

UpdatedUpdatedmodelmodel ×

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

--

++ ++

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++--

++

LightLight

++ ×

BiologistBiologist

Knowledge, Data, and the BiologistKnowledge, Data, and the Biologist

Observed expressionObserved expression levels from cDNAlevels from cDNA microarrays or frommicroarrays or from other sourcesother sources

Revised model of geneRevised model of gene regulatory processes regulatory processes that explains observedthat explains observed expression dataexpression data

Initial model of geneInitial model of gene regulatory processes, regulatory processes, gene ontologies, and gene ontologies, and biological constraintsbiological constraints

DiscoveryDiscovery

Domain knowledgeDomain knowledgeExperimental dataExperimental data

UpdatedUpdatedmodelmodel

BiologistBiologist

Knowledge, Data, and the BiologistKnowledge, Data, and the Biologist

Cyanobacteria are 3.5 billion years old and created Earth’s early Cyanobacteria are 3.5 billion years old and created Earth’s early oxygen atmosphere.oxygen atmosphere.

Algae and Cyanobacteria produce most of the oxygen we breath Algae and Cyanobacteria produce most of the oxygen we breath and fix most greenhouse carbon dioxide. and fix most greenhouse carbon dioxide.

Reasons for Studying CyanobacteriaReasons for Studying Cyanobacteria

Thus, together they form the base of the marine ecosystem.Thus, together they form the base of the marine ecosystem.

Collecting Data on Photosynthetic ProcessesCollecting Data on Photosynthetic Processes

Stress (e.g., High Light)Stress (e.g., High Light)

Adaptation PeriodAdaptation Period

Sampling mRNA/cDNASampling mRNA/cDNA

Equlibrium PeriodEqulibrium Period

MicroarrayMicroarrayTraceTrace

Continuous Culture (Chemostat)Continuous Culture (Chemostat)

/wwwscience.murdoch.edu.au/teach

www.affymetrix.com/

www.affymetrix.com/

Hea

lth

of C

ultu

reH

ealt

h of

Cul

ture

TimeTime

http://www.bio.ic.ac.uk/research/barber/photosystemII.htmlhttp://www.bio.ic.ac.uk/research/barber/photosystemII.html

A Biologist’s Depiction of PhotosynthesisA Biologist’s Depiction of Photosynthesis

Challenge 1: Representing Biological ModelsChallenge 1: Representing Biological Models

qualitativequalitative rather than quantitative; rather than quantitative; abstractabstract in that they ignore many details; in that they ignore many details; causalcausal in that they describe chains of effects; in that they describe chains of effects; involve involve processesprocesses that involve biological mechanisms. that involve biological mechanisms.

To assist biologists in their modeling efforts, we must first encode To assist biologists in their modeling efforts, we must first encode candidate models; however, most biological models are: candidate models; however, most biological models are:

We need some formal way to represent such models that can be We need some formal way to represent such models that can be interpreted computationally. interpreted computationally.

Some Representations of Biological KnowledgeSome Representations of Biological Knowledge

taxonomiestaxonomies

differentialdifferentialequationsequations

BooleanBooleannetworksnetworks

BayesianBayesiannetworksnetworks

How do plants modify their photosynthetic apparatus in high light?How do plants modify their photosynthetic apparatus in high light?

An Abstract Qualitative Causal ModelAn Abstract Qualitative Causal Model

dspAdspA

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++

++

--

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

This model is qualitative but relates continuous variables, much asThis model is qualitative but relates continuous variables, much asformalisms from qualitative physics (e.g., Forbus, 1984).formalisms from qualitative physics (e.g., Forbus, 1984).

Challenge 2: Making Predictions from ModelsChallenge 2: Making Predictions from Models

that some partial correlations will be zero;that some partial correlations will be zero; that some partial correlation products will be equal; and that some partial correlation products will be equal; and the signs of correlations between variables.the signs of correlations between variables.

To evaluate a regulatory model, it must make predictions about To evaluate a regulatory model, it must make predictions about quantitative measures of gene expression. quantitative measures of gene expression.

These predictions assume each that variable is a linear function of These predictions assume each that variable is a linear function of its causal parents, as in Glymour et al.’s (1987) Tetrad. its causal parents, as in Glymour et al.’s (1987) Tetrad.

A qualitative model cannot predict numeric values but can predict:A qualitative model cannot predict numeric values but can predict:

Some models must also include statements that certain regulatory Some models must also include statements that certain regulatory pathways dominate others. pathways dominate others.

Implications of Three Causal ModelsImplications of Three Causal Models

XX YY ZZ

XX YY ZZ

XX

YY ZZ

XZ.YXZ.Y = 0 = 0

XZ.YXZ.Y 0 0

XZ.YXZ.Y 0 0

Note that these implications do not depend on the effect’s sign.Note that these implications do not depend on the effect’s sign.

Challenge 3: Encoding Background KnowledgeChallenge 3: Encoding Background Knowledge

an initial qualitative model of gene regulation;an initial qualitative model of gene regulation; genes that may be involved in the phenomena; genes that may be involved in the phenomena; a taxonomy of these relevant genes; anda taxonomy of these relevant genes; and constraints on links between types of genes.constraints on links between types of genes.

To constrain candidate models, we must encode knowledge about To constrain candidate models, we must encode knowledge about biological entities and processes. biological entities and processes.

This background knowledge can take the form of:This background knowledge can take the form of:

Analysis of biological data should take into account knowledge Analysis of biological data should take into account knowledge about the organism under study. about the organism under study.

We can start with an initial causal model proposed by biologists.We can start with an initial causal model proposed by biologists.

Some Constraints on Biological ModelsSome Constraints on Biological Models

We can also forbid causal links between certain pairs of variables.We can also forbid causal links between certain pairs of variables.

dspAdspA

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++

++

--

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

××

Challenge 4: Revising Models Given Expression DataChallenge 4: Revising Models Given Expression Data

the the initial stateinitial state from which to start search; from which to start search; the the operatorsoperators that generate new states; that generate new states; the the evaluation functionevaluation function that selects among states; that selects among states; the overall the overall control regimecontrol regime for the search; and for the search; and the the halting criterionhalting criterion for ending the search. for ending the search.

To revise a regulatory model, we must develop an algorithm that To revise a regulatory model, we must develop an algorithm that searches through the space of models. searches through the space of models.

This requires us to make design decisions about:This requires us to make design decisions about:

We have implemented a two-stage method to search the space of We have implemented a two-stage method to search the space of qualitative causal models of gene regulation. qualitative causal models of gene regulation.

Stage 1: Determining Model StructureStage 1: Determining Model Structure

Initial state:Initial state: A preliminary model proposed by a biologist. A preliminary model proposed by a biologist.

Operators:Operators: Add a new link (constrained by variable types); Add a new link (constrained by variable types); Delete an existing link. Delete an existing link.

Evaluation:Evaluation: Agreement with predicted relations among partial Agreement with predicted relations among partial correlations, similar to those used in Tetrad.correlations, similar to those used in Tetrad.

Control:Control: Greedy search to select best structure on each round. Greedy search to select best structure on each round.

Halting:Halting: Stop when there is no further improvement in the Stop when there is no further improvement in the evaluation metric. evaluation metric.

Our system carries out heuristic search through the space of causal Our system carries out heuristic search through the space of causal model structures. model structures.

Stage 2: Adding Signs to the ModelStage 2: Adding Signs to the Model

Initial state:Initial state: The unsigned model structure generated in Stage 1. The unsigned model structure generated in Stage 1.

Operators:Operators: Associate a sign (+ or –) with a given link; Associate a sign (+ or –) with a given link; Label some pathways as dominant over others. Label some pathways as dominant over others.

Evaluation:Evaluation: Agreement with the signs of correlations computed Agreement with the signs of correlations computed from the data.from the data.

Control:Control: Exhaustive search for small models; Exhaustive search for small models; Greedy search for more complex models.Greedy search for more complex models.

Halting:Halting: Stop when each link has an associated sign. Stop when each link has an associated sign.

Our system carries out a second search through the space of signed Our system carries out a second search through the space of signed qualitative models. qualitative models.

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6

NBLR

NBLA

cpcB

psbA2

psbA1

dspA

PBS

Expression Data on Photosynthetic RegulationExpression Data on Photosynthetic Regulation

Initial study produced four replications at each of five time steps.Initial study produced four replications at each of five time steps.

Changes to the model improve its match to the expression data.Changes to the model improve its match to the expression data.

A Revised Model of Photosynthesis RegulationA Revised Model of Photosynthesis Regulation

--

++

dspAdspA

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++--

LightLight

++ × ×

Similar changes adapt the model to expression data from mutantsSimilar changes adapt the model to expression data from mutants ..

Challenge 5: Dealing with Small Data SetsChallenge 5: Dealing with Small Data Sets

starts from an initial model rather than from scratch;starts from an initial model rather than from scratch; incorporates biological constraints on model revisions; incorporates biological constraints on model revisions; uses bootstrap sampling to generate 20 data sets, then runs the uses bootstrap sampling to generate 20 data sets, then runs the

revision method 20 times and retains only changes that occur revision method 20 times and retains only changes that occur in at least 75% of the runs.in at least 75% of the runs.

Microarray technology provides many measurements but it often Microarray technology provides many measurements but it often gives very few gives very few samplessamples. .

To reduce variance and avoid overfitting these data, our method: To reduce variance and avoid overfitting these data, our method:

Experimental studies suggest that these strategies reduce variance Experimental studies suggest that these strategies reduce variance and produce more robust models. and produce more robust models.

Experimental Studies with Synthetic DataExperimental Studies with Synthetic Data

To evaluate our revision method, we used a target model to create To evaluate our revision method, we used a target model to create synthetic data and systematically varied distance from that model. synthetic data and systematically varied distance from that model.

0

0.5

1

1.5

2

2.5

3

0 1 2 4 6

Model Errors

Stru

ctur

e C

hang

es

CorrectionsErrors

The number of incorrect revisions seems unaffected by distance.The number of incorrect revisions seems unaffected by distance.

represent biological models with time-delayed effects;represent biological models with time-delayed effects;

utilize these time-delayed models to make predictions;utilize these time-delayed models to make predictions;

evaluate alternative models in terms of their fit to data;evaluate alternative models in terms of their fit to data;

carry out search through the space of alternative models. carry out search through the space of alternative models.

Many biological processes occur over extended periods of time; Many biological processes occur over extended periods of time; to deal with such phenomena, we need methods that:to deal with such phenomena, we need methods that:

We have extended our framework to handle qualitative causal We have extended our framework to handle qualitative causal models with time delays and we have done initial evaluations. models with time delays and we have done initial evaluations.

Challenge 5: Dealing with Temporal PhenomenaChallenge 5: Dealing with Temporal Phenomena

We can handle temporal phenomena by adding time delays to links.We can handle temporal phenomena by adding time delays to links.

A Regulatory Model with Time DelaysA Regulatory Model with Time Delays

dspAdspA

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

1313

77

2020

33

1717

1717

1717

psbA1psbA1

psbA2psbA2

cpcBcpcB

88

9966

1616

LightLight

66

1515

This model predicts the system’s qualitative behavior over time.This model predicts the system’s qualitative behavior over time.

Synthetic Data from Time-Delay ModelSynthetic Data from Time-Delay Model

0 10 20 30 40 50 60 70 80 90 1005

10

15

20

25

30

Light

NBLA Health

A Method for Revising Time-Delay ModelsA Method for Revising Time-Delay Models

Generalize correlation and partial correlation to frequency domain.Generalize correlation and partial correlation to frequency domain.

Our method reconstructs most of this model from synthetic data.Our method reconstructs most of this model from synthetic data.

A Reconstructed Model with Time DelaysA Reconstructed Model with Time Delays

dspAdspA

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

1313

77

2020

33

1717

1717

1717

psbA1psbA1

psbA2psbA2

cpcBcpcB

88

9966

1616

LightLight

66

×

Determining the link delays from time series seems tractable, Determining the link delays from time series seems tractable, butbut this requires a high sampling rate.this requires a high sampling rate.

specify qualitative causal models of biological systems;specify qualitative causal models of biological systems;

display and edit a model’s structure and details graphically;display and edit a model’s structure and details graphically;

incorporate knowledge and results from previous studies;incorporate knowledge and results from previous studies;

evaluate the evidence in favor of specific hypotheses; evaluate the evidence in favor of specific hypotheses;

propose revisions to the model in response to observations.propose revisions to the model in response to observations.

We are developing an environment that lets its biologist users:We are developing an environment that lets its biologist users:

The environment will offer computational assistance in forming The environment will offer computational assistance in forming and evaluating models but let the biologist retain control. and evaluating models but let the biologist retain control.

Challenge 6: Interfacing with BiologistsChallenge 6: Interfacing with Biologists

An Interactive Environment for Biological ModelingAn Interactive Environment for Biological Modeling

Additional Work on Biological ModelingAdditional Work on Biological Modeling

developing other approaches to revising regulatory models, developing other approaches to revising regulatory models, including Bayesian scoring and neural networks;including Bayesian scoring and neural networks;

introducing taxonomic knowledge about genes and biological introducing taxonomic knowledge about genes and biological processes to constrain the search process; andprocesses to constrain the search process; and

expanding the modeling formalism to represent biological expanding the modeling formalism to represent biological mechanisms in addition to abstract processes. mechanisms in addition to abstract processes.

Our ongoing research on biological model revision has involved:Our ongoing research on biological model revision has involved:

Thus, we continue to explore ways to combine knowledge with Thus, we continue to explore ways to combine knowledge with data to aid the creation of biological models. data to aid the creation of biological models.

Additional Models and DataAdditional Models and Data

naturalistic data on photosynthesis regulation in Cyanobacteria naturalistic data on photosynthesis regulation in Cyanobacteria in a setting that mimics the day/night cycle;in a setting that mimics the day/night cycle;

testing if certain genes are targets of unobserved transcription testing if certain genes are targets of unobserved transcription factors, using time-series data on the yeast cell cycle;factors, using time-series data on the yeast cell cycle;

testing whether the transcription factor c-Jun is activated by testing whether the transcription factor c-Jun is activated by anything other than Jnk2, using data on healthy lung tissue.anything other than Jnk2, using data on healthy lung tissue.

We are also applying our biological modeling framework to:We are also applying our biological modeling framework to:

These efforts should further test the robustness of our approach These efforts should further test the robustness of our approach and provide evidence of its generality. and provide evidence of its generality.

Intellectual InfluencesIntellectual Influences

qualitative physics and simulation (e.g., Forbus, 1984); qualitative physics and simulation (e.g., Forbus, 1984);

linear causal models and their inference (Glymour et al., 1987);linear causal models and their inference (Glymour et al., 1987);

computational scientific discovery (e.g., Langley et al., 1987); computational scientific discovery (e.g., Langley et al., 1987);

theory revision in machine learning (e.g., Towell, 1991);theory revision in machine learning (e.g., Towell, 1991);

interactive tools for data analysis (e.g., Schneiderman, 2001).interactive tools for data analysis (e.g., Schneiderman, 2001).

Our approach to computational biological discovery borrows ideas Our approach to computational biological discovery borrows ideas from many traditions:from many traditions:

Our work combines, in novel ways, insights from machine learning, Our work combines, in novel ways, insights from machine learning, knowledge representation, and human-computer interaction.knowledge representation, and human-computer interaction.

Contributions of the ResearchContributions of the Research

representing biological models that are qualitative and abstract;representing biological models that are qualitative and abstract;

making testable predictions from such qualitative causal models;making testable predictions from such qualitative causal models;

encoding knowledge about biological entities and processes; encoding knowledge about biological entities and processes;

utilizing knowledge and data to revise initial process models;utilizing knowledge and data to revise initial process models;

making revision methods robust despite small amounts of data;making revision methods robust despite small amounts of data;

developing interactive tools that let biologists remain in control.developing interactive tools that let biologists remain in control.

In summary, our work on computational biological modeling and In summary, our work on computational biological modeling and discovery responds to six major challenges:discovery responds to six major challenges:

Taken together, our six responses constitute a novel and promising Taken together, our six responses constitute a novel and promising approach to elucidating biological models. approach to elucidating biological models.

Pat LangleyPat Langley

Jeff ShragerJeff ShragerInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise

Palo Alto, CaliforniaPalo Alto, Californiaandand

Andrew PohorilleAndrew PohorilleCenter for Computational AstrobiologyCenter for Computational Astrobiology

NASA Ames Research CenterNASA Ames Research Center

Moffett Field, CaliforniaMoffett Field, California

Revising Qualitative ModelsRevising Qualitative Modelsof Gene Regulationof Gene Regulation

Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer. Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer.

Greedy Search Through a Space of ModelsGreedy Search Through a Space of Models

Initial model

Revision 1.1 Revision 1.2 Revision 1.3 Revision 1.4

Revision 2.1 Revision 2.2 Revision 2.3 Revision 2.4

Revision 3.1 Revision 3.2 Revision 3.3 Revision 3.4

Synthetic Data from Time-Delay ModelSynthetic Data from Time-Delay Model

0 10 20 30 40 50 60 70 80 90 100-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Light NBLA Health