48
Pat Langley Pat Langley Institute for the Study of Learning and Expertise Institute for the Study of Learning and Expertise Palo Alto, California Palo Alto, California and and Center for the Study of Language and Information Center for the Study of Language and Information Stanford University, Stanford, California Stanford University, Stanford, California http://www.isle.org/~langley http://www.isle.org/~langley [email protected] [email protected] Knowledge and Data in Knowledge and Data in Computational Biological Computational Biological Discovery Discovery V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, M. Schw V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, M. Schw r, and A. Torregrosa. r, and A. Torregrosa.

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

  • Upload
    sonora

  • View
    38

  • Download
    3

Embed Size (px)

DESCRIPTION

Knowledge and Data in Computational Biological Discovery. Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://www.isle.org/ ~ langley - PowerPoint PPT Presentation

Citation preview

Page 1: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat LangleyPat LangleyInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise

Palo Alto, CaliforniaPalo Alto, Californiaandand

Center for the Study of Language and InformationCenter for the Study of Language and InformationStanford University, Stanford, CaliforniaStanford University, Stanford, California

http://www.isle.org/~langleyhttp://www.isle.org/~langley

[email protected]@csli.stanford.edu

Knowledge and Data in Knowledge and Data in Computational Biological DiscoveryComputational Biological Discovery

Thanks to V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, M. Schwabacher, Thanks to V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, M. Schwabacher, J. Shrager, and A. Torregrosa. J. Shrager, and A. Torregrosa.

Page 2: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Motivations for Computational DiscoveryMotivations for Computational Discovery

better predict and control future eventsbetter predict and control future events understand both previous and future eventsunderstand both previous and future events communicate that understanding to otherscommunicate that understanding to others

Humans strive to discover new knowledge from experience so that Humans strive to discover new knowledge from experience so that they can:they can:

Computational techniques should let us automate and/or assist this Computational techniques should let us automate and/or assist this discovery process.discovery process.

Recent research on computational discovery has made progress on Recent research on computational discovery has made progress on some of these issues but downplayed others.some of these issues but downplayed others.

Page 3: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Three RevolutionsThree Revolutions

The The scientificscientific revolution (~1700) introduced formalisms to revolution (~1700) introduced formalisms to describe and explain natural phenomena.describe and explain natural phenomena.

The The heuristicheuristic searchsearch revolution (~1957) introduced computer revolution (~1957) introduced computer algorithms to automate problem solving.algorithms to automate problem solving.

The The datadata revolution (~1995) introduced collection of large data revolution (~1995) introduced collection of large data repositories for many domains.repositories for many domains.

The discovery process has been aided by three major advances:The discovery process has been aided by three major advances:

Different paradigms for computer-aided discovery focus on some Different paradigms for computer-aided discovery focus on some developments more than others.developments more than others.

Page 4: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The Data Mining ParadigmThe Data Mining Paradigm

emphasizing the availability of vast amounts of data;emphasizing the availability of vast amounts of data; drawing on computational heuristic search to find regularities drawing on computational heuristic search to find regularities

in these data; in these data; using formalisms like decision trees, association rules, and using formalisms like decision trees, association rules, and

Bayes nets to describe those regularities.Bayes nets to describe those regularities.

One paradigm, often known as One paradigm, often known as datadata miningmining or or KDDKDD, can be best , can be best characterized as:characterized as:

Thus, most KDD researchers favor their own formalisms over Thus, most KDD researchers favor their own formalisms over those used by scientists and engineers.those used by scientists and engineers.

As a result, their discoveries are seldom very As a result, their discoveries are seldom very communicablecommunicable to to members of those communities.members of those communities.

Page 5: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The Scientific Discovery ParadigmThe Scientific Discovery Paradigm

drawing on computational heuristic search to find regularities drawing on computational heuristic search to find regularities in scientific data, either historical or novel; in scientific data, either historical or novel;

using formalisms like numeric laws, structural models, and using formalisms like numeric laws, structural models, and reaction pathways to describe regularities.reaction pathways to describe regularities.

A second paradigm, A second paradigm, computational scientific discoverycomputational scientific discovery, can be , can be characterized as:characterized as:

Thus, researchers in this framework favor representations used by Thus, researchers in this framework favor representations used by scientists and engineers.scientists and engineers.

As a result, their system’s discoveries are more As a result, their system’s discoveries are more communicablecommunicable to to members of those communities.members of those communities.

Page 6: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Time Line for Research on Time Line for Research on Computational Scientific DiscoveryComputational Scientific Discovery

1989 19901979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Bacon.1–Bacon.5 Abacus, Coper

Fahrehneit, E*, Tetrad, IDSN

Hume,ARC

DST, GPN

LaGrange SDS SSF, RF5,LaGramge

Dalton, Stahl

RL, Progol

Gell-Mann BR-3,Mendel PauliStahlp,

RevolverDendral

AM Glauber NGlauber IDSQ,

Live

IE Coast, Phineas,AbE, Kekada Mechem, CDP Astra,

GPM

HR

BR-4

Numeric laws Qualitative laws Structural models Process modelsLegendLegend

Page 7: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Successes of Computational Scientific DiscoverySuccesses of Computational Scientific Discovery

Over the past decade, systems of this type have helped discover Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: new knowledge in many scientific fields:

• stellar taxonomies from infrared spectra (Cheeseman et al., 1989)stellar taxonomies from infrared spectra (Cheeseman et al., 1989)

• qualitative chemical factors in mutagenesis (King et al., 1996)qualitative chemical factors in mutagenesis (King et al., 1996)

• quantitative laws of metallic behavior (Sleeman et al., 1997)quantitative laws of metallic behavior (Sleeman et al., 1997)

• quantitative conjectures in graph theory (Fajtlowicz et al., 1988)quantitative conjectures in graph theory (Fajtlowicz et al., 1988)

• temporal laws of ecological behavior (Todorovski et al., 2000)temporal laws of ecological behavior (Todorovski et al., 2000)

• reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)

Each of these has led to publications in the refereed literature Each of these has led to publications in the refereed literature of the relevant scientific field. of the relevant scientific field.

Page 8: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Research ThemesResearch Themes

focusing on domains that involve focusing on domains that involve temporaltemporal and and spatialspatial data data generating generating explanationsexplanations that involve hidden objects/variables that involve hidden objects/variables drawing on drawing on domain knowledgedomain knowledge to constrain the search process to constrain the search process developing developing interactiveinteractive discovery tools for use by scientists discovery tools for use by scientists

We aim to extend previous approaches to computational discovery We aim to extend previous approaches to computational discovery of communicable knowledge by:of communicable knowledge by:

Within these guidelines, we are open to any search algorithm that Within these guidelines, we are open to any search algorithm that can produce such communicable knowledge. can produce such communicable knowledge.

As in earlier work, our notation for discovered knowledge will be As in earlier work, our notation for discovered knowledge will be the same as that used by experts in the domain.the same as that used by experts in the domain.

Page 9: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Some Interesting Ecological QuestionsSome Interesting Ecological Questions

What environmental variables determine the production of What environmental variables determine the production of carbon and the generation of various gases?carbon and the generation of various gases?

What functional forms relate these predictive variables to the What functional forms relate these predictive variables to the ones they influence? ones they influence?

How do extreme values of these variables affect behavior of How do extreme values of these variables affect behavior of the ecosystem? the ecosystem?

Are the Earth ecosystem parameters constant or have values Are the Earth ecosystem parameters constant or have values changed in recent years? changed in recent years?

Page 10: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The Task of Ecological ModelingThe Task of Ecological Modeling

GivenGiven: A model of Earth’s ecosystem (CASA) stated as difference : A model of Earth’s ecosystem (CASA) stated as difference equations that involve observable and hidden variables.equations that involve observable and hidden variables.

GivenGiven: Values of observable variables (rainfall, sunlight, NPP) as : Values of observable variables (rainfall, sunlight, NPP) as they change over both time and space.they change over both time and space.

GivenGiven: Inferred values for global parameters and intrinsic properties : Inferred values for global parameters and intrinsic properties associated with discrete variables (e.g., ground cover).associated with discrete variables (e.g., ground cover).

FindFind: A revised ecosystem model with altered equations and/or : A revised ecosystem model with altered equations and/or parametric values that better fits the data.parametric values that better fits the data.

S_LEAFS_LEAF

NPPNPP

M_LEAFM_LEAF M_ROOTM_ROOT S_ROOTS_ROOT

MIN_NMIN_NLEAF_MICLEAF_MIC SOIL_MICSOIL_MIC

Page 11: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The NPPc Portion of CASAThe NPPc Portion of CASA

NPPc = NPPc = monthmonth max max (E(E··IPAR, 0)IPAR, 0)

E = 0.56 · T1 · T2 · WE = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · ToptT1 = 0.8 + 0.02 · Topt – 0.0005 · Topt22

T2 = 1.18 / [(1 + T2 = 1.18 / [(1 + ee 0.2 · (Topt – Tempc – 10)0.2 · (Topt – Tempc – 10) ) · (1 + ) · (1 + ee 0.3 · (Tempc – Topt – 10)0.3 · (Tempc – Topt – 10) )] )] W = 0.5 + 0.5 · EET / PETW = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI)PET = 1.6 · (10 · Tempc / AHI)AA · PET-TW-M if Tempc > 0 · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0PET = 0 if Tempc < 0 A = 0.00000068 · AHIA = 0.00000068 · AHI33 – 0.000077 · AHI – 0.000077 · AHI22 + 0.018 · AHI + 0.49 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-ConverIPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = FPAR-FAS = minmin [(SR-FAS – 1.08) / [(SR-FAS – 1.08) / SRSR (UMD-VEG) , 0.95] (UMD-VEG) , 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

Page 12: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The NPPc Portion of CASAThe NPPc Portion of CASA

NPPc

IPAR

PET

T1T2We_max

E

EET

Tempc

Topt

NDVI

SOLAR

AHI

A

PETTWM

SR

FPAR

VEG

Page 13: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The RF6 Discovery AlgorithmThe RF6 Discovery Algorithm

1. Creates a multilayer neural network that links predictive with 1. Creates a multilayer neural network that links predictive with predicted variables using additive and product units.predicted variables using additive and product units.

2. Invokes the BPQ algorithm to search through the weight space 2. Invokes the BPQ algorithm to search through the weight space defined by this network.defined by this network.

They have shown this approach can discover an impressive class They have shown this approach can discover an impressive class of numeric equations from noisy data.of numeric equations from noisy data.

Saito and Nakano (2000) describe RF6, a discovery system that: Saito and Nakano (2000) describe RF6, a discovery system that:

3. Transforms the resulting network into a polynomial equation 3. Transforms the resulting network into a polynomial equation

of the form of the form yy = = c cii x x jjd d ij ij ..

Page 14: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Improving the NPPc Portion of CASAImproving the NPPc Portion of CASA

1. Transform the NPPc model into a multilayer neural network 1. Transform the NPPc model into a multilayer neural network that predicts the same behavior.that predicts the same behavior.

2. Identify portions of the NPPc model that are likely candidates 2. Identify portions of the NPPc model that are likely candidates for improvement.for improvement.

3. Run the RF6 algorithm to revise those portions of the model 3. Run the RF6 algorithm to revise those portions of the model (e.g., specified parameters or equations).(e.g., specified parameters or equations).

4. Transform the revised multilayer network back into numeric 4. Transform the revised multilayer network back into numeric equations using the improved components.equations using the improved components.

This suggests an approach to revising the NPPc model to better This suggests an approach to revising the NPPc model to better fit the observed data: fit the observed data:

Page 15: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Three Facets of Model RevisionThree Facets of Model Revision

Altering the value of parameters in a specified equation;Altering the value of parameters in a specified equation;

Changing the associated values for an intrinsic property; andChanging the associated values for an intrinsic property; and

Replacing the equation for a term with another expression.Replacing the equation for a term with another expression.

Rather than initializing weights randomly, the system starts with Rather than initializing weights randomly, the system starts with weights based on parameters in the original model.weights based on parameters in the original model.

We have applied this strategy to improve three different portions We have applied this strategy to improve three different portions of the NPPc submodel. of the NPPc submodel.

We have adapted RF6 to revise an existing quantitative model in We have adapted RF6 to revise an existing quantitative model in three distinct ways:three distinct ways:

Page 16: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Altering Parameters in the NPPc ModelAltering Parameters in the NPPc Model

Initial model:Initial model: T2 = 1.18 / [(1 + e T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10)0.2 · (Topt – Tempc – 10) ) · (1 + e ) · (1 + e 0.3 · (Tempc – Topt – 10)0.3 · (Tempc – Topt – 10) )] )]Cross-validated RMSE = 467.910Cross-validated RMSE = 467.910Behavior: Behavior: Gaussian-like function of temperature difference.Gaussian-like function of temperature difference.

Revised model:Revised model: T2 = 1.80 / [(1 + e T2 = 1.80 / [(1 + e 0.05 · (Topt – Tempc – 10.8)0.05 · (Topt – Tempc – 10.8) ) · (1 + e ) · (1 + e 0.3 · (Tempc – Topt – 90.33)0.3 · (Tempc – Topt – 90.33) )] )]Cross-validated RMSE = 461.466 [ one percent reduction ]Cross-validated RMSE = 461.466 [ one percent reduction ]Behavior: Behavior: nearly flat function in actual range of temperature difference.nearly flat function in actual range of temperature difference.

Conclusion:Conclusion: The T2 temperature stress term contributes little to the The T2 temperature stress term contributes little to theoverall predictive ability of the NPPc submodel.overall predictive ability of the NPPc submodel.

Page 17: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Revising Intrinsic Values in the ModelRevising Intrinsic Values in the Model

The NPPc submodel includes one intrinsic property, SR, associated with The NPPc submodel includes one intrinsic property, SR, associated with the variable for vegetation type, UMD-VEG.the variable for vegetation type, UMD-VEG.

The corresponding RF6 network includes one hidden node for SR and The corresponding RF6 network includes one hidden node for SR and one dummy input variable for each vegetation type.one dummy input variable for each vegetation type.

Veg type A B C D E F G H I J KVeg type A B C D E F G H I J K

Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05 RevisedRevised 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60

RMSE = 467.910 RMSE = 467.910 for the original model;for the original model;RMSE = 448.376 RMSE = 448.376 for the revised model, an improvement of four percent.for the revised model, an improvement of four percent.

Observation:Observation: Nearly all intrinsic values are lower in the revised model. Nearly all intrinsic values are lower in the revised model.

Page 18: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Revising Equations in the NPPc ModelRevising Equations in the NPPc Model

Initial model:Initial model: E = 0.56 · T1 · T2 · WE = 0.56 · T1 · T2 · WCross-validated RMSE = 467.910Cross-validated RMSE = 467.910Behavior: Behavior: Each stress term decreases the photosynthetic efficiency E.Each stress term decreases the photosynthetic efficiency E.

Revised model:Revised model: E = 0.521 · T1E = 0.521 · T10.000.00 · T2 · T2 0.030.03 · W · W 0.000.00

Cross-validated RMSE = 446.270 [ five percent reduction ]Cross-validated RMSE = 446.270 [ five percent reduction ]Behavior: Behavior: T1 and W have no effect on E and T2 has only a minor effect .T1 and W have no effect on E and T2 has only a minor effect .

Conclusion:Conclusion: The stress terms are not useful to the NPPc model, most The stress terms are not useful to the NPPc model, mostlikely because of recent improvements in NDVI measures.likely because of recent improvements in NDVI measures.

Page 19: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Future Work on Ecological ModelingFuture Work on Ecological Modeling

Apply revision method to other parts of NPPc submodel Apply revision method to other parts of NPPc submodel and other static parts of CASA model. and other static parts of CASA model.

Extend revision method to improve parts of CASA that Extend revision method to improve parts of CASA that involve difference equations. involve difference equations.

Develop software for visualizing both spatial and temporal Develop software for visualizing both spatial and temporal anomalies, as well as relating them to model. anomalies, as well as relating them to model.

Implement an interactive system that lets scientists direct Implement an interactive system that lets scientists direct high-level search for improved ecosystem models.high-level search for improved ecosystem models.

Page 20: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Visualizing Errors in the ModelVisualizing Errors in the Model

We can easily plot an improved model’s errors in spatial terms.We can easily plot an improved model’s errors in spatial terms.

Such displays can help suggest causes for prediction errors and thus Such displays can help suggest causes for prediction errors and thus ways to further improve the model. ways to further improve the model.

Page 21: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Some Interesting Biological Questions

How do organisms acclimate to increased temperature or ultraviolet radiation?

Why do we observe bleaching of plant cells under high light conditions?

What differences in biological processes exist between a mutant organism and the original?

What are the effects on an organism’s biological processes when one of its important genes is removed?

Page 22: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Modeling Results in Microarrary Experiments

Given: A mutated organism with different macroscopic behavior in that environmental setting.

Given: Observed expression levels, over time, of the mutant’s enzymes in the setting.

Find: A revised model with altered reactions and regulations that explains the expression levels.

Given: Qualitative knowledge about an organism’s reactions and regulations for some environmental setting.

Page 23: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Modeling Microarrary Results on Photosynthesis

Given: A mutated strain of Cyanobacteria that does not bleach when exposed to high ultraviolet light.

Given: Observed expression levels, over time, of the mutant’s enzymes in the presence of high ultraviolet light.

Find: A revised model with altered reactions and regulations that explains the expression levels and the failure to bleach.

Given: Qualitative knowledge about reactions and regulations for Cyanobacteria in a high ultraviolet situation.

Page 24: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Why do plants modify their photosynthetic apparatus in high light?

A Model of Photosynthesis Regulation

HL-N-S-P-Cl

nblS

RRcpcXhliApsbx...

Blue/UV-APhotoreceptor

nblRnblBnblA

Degradation ofpsaF,psaA,psaB

Survival in High Light

Modification ofPhotosynthesis

Page 25: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Collecting Data on Photosynthetic Processes

Stress (e.g., High Light)

Adaptation Period

Sampling mRNA/cDNA

Equlibrium Period

MicroArrayTrace

Continuous Culture (Chemostat)/wwwscience.murdoch.edu.au/teach

www.affymetrix.com/

www.affymetrix.com/

Hea

lth o

f Cul

ture

Time

Page 26: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Microarray Data on Photosynthetic Regulation

Page 27: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Six Steps in Revising Regulation Models

Our approach to revising an existing model involves six steps:

1. Generate candidate models with a single process removed.

2. Predict qualitative correlations between enzymes for each model.

3. Calculate the observed correlations between enzymes over time.

4. Measure the percentage of correct predictions for each model.

5. Select the revised model with the highest predictive accuracy.

6. Repeat this strategy until no revision leads to improvement.

Thus, our system carries out heuristic search through the space of models, guided by candidates’ abilities to explain the data.

Page 28: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Heuristic Search Through a Space of Models

Initial modelInitial model

Revision 1.1Revision 1.1 Revision 1.2Revision 1.2 Revision 1.3Revision 1.3 Revision 1.4Revision 1.4

Revision 2.1Revision 2.1 Revision 2.2Revision 2.2 Revision 2.3Revision 2.3 Revision 2.4Revision 2.4

Revision 3.1Revision 3.1 Revision 3.2Revision 3.2 Revision 3.3Revision 3.3 Revision 3.4Revision 3.4

Page 29: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The mutant is NblR deficient, so it does not down regulate NblA/B.

HL-N-S-P-Cl

nblS

RRcpcXhliApsbx...

Blue/UV-APhotoreceptor

nblRnblBnblA

Survival in High Light

Modification ofPhotosynthesis

A Revised Model of Photosynthesis Regulation

X Degradation ofpsaF,psaA,psaB

Page 30: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Observed and Predicted Correlations

Observed:nblS,nblR +nblS,nblA ×nblS,nblB ×nblS,psaF ×nblS,psaA ×nblS,paaB ×nblR,nblA ×nblR,nblB ×nblR,psaF ×nblR,psaA ×nblR,psaB ×nblA,psaF +nblA,psaA +nblA,psaB +nblA,psaF +nblA,psaA + • • •

nblS,nblR +nblS,nblA +nblS,nblB +nblS,psaF +nblS,psaA +nblS,paaB +nblR,nblA +nblR,nblB +nblR,psaF +nblR,psaA +nblR,psaB +nblA,psaF +nblA,psaA +nblA,psaB +nblA,psaF +nblA,psaA + • • •

nblS,nblR +nblS,nblA ×nblS,nblB ×nblS,psaF ×nblS,psaA ×nblS,paaB ×nblR,nblA ×nblR,nblB ×nblR,psaF ×nblR,psaA ×nblR,psaB ×nblA,psaF +nblA,psaA +nblA,psaB +nblA,psaF +nblA,psaA + • • •

Original: Revised:

Page 31: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Future Work on Biological Modeling

Add more knowledge about photosynthetic pathways and use to interpret additional microarray data.

Incorporate ability to introduce new regulation influences in addition to removing existing ones.

Expand modeling formalism to include abstract processes like signal transduction and allosteric modulation.

Implement an interactive system that lets scientists direct high-level search for improved biological process models.

Page 32: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Concluding RemarksConcluding Remarks

attempts to move beyond description and prediction to both attempts to move beyond description and prediction to both explanation and understanding;explanation and understanding;

uses domain knowledge to initialize search and to characterize uses domain knowledge to initialize search and to characterize differences from revised model;differences from revised model;

presents the new knowledge in some presents the new knowledge in some communicablecommunicable notation notation that is familiar to domain experts.that is familiar to domain experts.

In summary, unlike work in the data mining paradigm, our research In summary, unlike work in the data mining paradigm, our research on computational discovery:on computational discovery:

Such techniques will improve the way we manipulate, utilize, and Such techniques will improve the way we manipulate, utilize, and understand complex scientific and engineering data. understand complex scientific and engineering data.

Page 33: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and
Page 34: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Improving the Prediction of NDVIImproving the Prediction of NDVI

The Normalized Difference Vegetative Index (NDVI) is a central The Normalized Difference Vegetative Index (NDVI) is a central part of CASA that is measured by satellite sensors.part of CASA that is measured by satellite sensors.

Unfortunately, NDVI is only available for the years since 1983, Unfortunately, NDVI is only available for the years since 1983, when satellites with these sensors were launched.when satellites with these sensors were launched.

Potter and Brooks (1998) report a predictive model of NDVI that is Potter and Brooks (1998) report a predictive model of NDVI that is a piecewise linear function of temperature, rainfall, and moisture.a piecewise linear function of temperature, rainfall, and moisture.

We hoped to improve this model using Cubist, which induces a set We hoped to improve this model using Cubist, which induces a set of regression rules from continuous data.of regression rules from continuous data.

Page 35: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Form of the CASA NPPc DataForm of the CASA NPPc Data

TempTempNPPcNPPc ToptTopt EETEET PETPET NDVINDVI AHIAHI VegVeg

JanuaryJanuaryFebruaryFebruary

MarchMarch

MayMayAprilApril

JuneJuneJulyJuly

AugustAugustSeptemberSeptember

NovemberNovemberOctoberOctober

DecemberDecember

Grid 1,1Grid 1,1 .. .. .. Grid 360,360Grid 360,360

Page 36: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

An Improved Piecewise Linear ModelAn Improved Piecewise Linear Model

Cubist produced a revised NDVI model with five piecewise linear Cubist produced a revised NDVI model with five piecewise linear components rather than two, all based on rainfall. components rather than two, all based on rainfall.

This model explains 88% of the variance, compared with 74% of This model explains 88% of the variance, compared with 74% of the variance for the Potter and Brooks model.the variance for the Potter and Brooks model.

Page 37: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Visualizing the Improved ModelVisualizing the Improved Model

One way to visualize the model involves plotting rules spatially.One way to visualize the model involves plotting rules spatially.

Our Earth science collaborators found this useful, as regions often Our Earth science collaborators found this useful, as regions often correspond to recognizable ecological zones. correspond to recognizable ecological zones.

Page 38: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

The Task of Metabolic Modeling

Given: Knowledge about the metabolism of an organism stated as biochemical reactions.

Given: Observed environmental situations and expression levels of enzymes from microarrays.

Find: A complete metabolic model that explains the observed expression levels.

Acetoacetyl-CoAAcetoacetyl-CoA

EC2.8.3.5EC2.8.3.5

AcetoacetateAcetoacetate

Acetyl-CoAAcetyl-CoA

EC4.1.3.5EC4.1.3.5 EC4.1.3.4EC4.1.3.4

IntermediateIntermediate

Page 39: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Five Steps in Metabolic Model Revision

Our general approach to metabolic modeling involves six steps:

1. Represent biochemical reactions known for the organism.

2. Find complete metabolic pathways through heuristic search.

3. Order metabolic pathways using matches to microarray data.

4. Simulate natural or experimental knockouts of genes/enzymes.

5. Propose bridging reactions that explain the observed behavior.

6. Order reactions using reaction analogy and DNA sequences.

We will illustrate these steps with an example from glycolysis and the TCA cycle.

Page 40: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Step 1. Represent Biochemical Reactions

Page 41: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

CYTOSOLIC:glucose + ATP ---[Hexokinase]-->

glucose 6-phosphate + ADP

CYTOSOLIC:1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]-->

3-phosphoglycerate + ATP

MITOCHONDRIAL:isocitrate + NAD+ ---[Isocitrate dehydrogenase]-->

a-ketoglutarate + NADH + H+ + Co2

MITOCHONDRIAL:succinyl CoA + GDP + phosphatate ---[Succinyl CoA synthase]-->

succinate + GTP + CoA

Step 1. Represent Biochemical Reactions

Page 42: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Step 2. Find Pathways by Heuristic Search

Target = Malate

Solution for Fructose environmentfructose ---[Fructokinase]--> fructose 1-phosphatefructose 1-phosphate ---[Fructose 1-phosphate aldolase]--> glyceraldehyde + dihydrozyacetone phosphatedihydrozyacetone phosphate ---[Isomerase]--> glyceraldehyde 3-phosphatephosphatate + NAD+ + glyceraldehyde 3-phosphate ---[Triose phosphate dehydrogenase]--> 1,3-bisphosphoglycerate1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]--> 3-phosphoglycerate + ATP3-phosphoglycerate ---[Phosphoglyceromutase]--> 2-phosphoglycerate2-phosphoglycerate ---[Enolase]--> phosphoenolpyruvate + H2Ophosphoenolpyruvate + ATP ---[Pyruvate kinase]--> pyruvate + ADPmalate + NAD+ ---[Malate dehydrogenase]--> oxaloacetate + NADH + H+pyruvate + NAD+ + CoA ---[NIL]--> NADH + H+ + Co2 + acetyl CoAacetyl CoA + oxaloacetate ---[Citrate synthase]--> citrate + CoAcitrate ---[Aconitase]--> isocitrateisocitrate + NAD+ ---[Isocitrate dehydrogenase]--> a-ketoglutarate + NADH + H+ + Co2a-ketoglutarate + NAD+ + CoA ---[a-ketogluterate dehydrogenase complex]--> succinyl CoA + NADH + H+ + Co2succinyl CoA + GDP + phosphatate ---[Succinyl CoA synthase]--> succinate + GTP + CoAsuccinate + FAD ---[Succinate dehydrogenase]--> fumarate + FADH2fumarate + H2O ---[Fumerase]--> malate

Solution for Glucose environmentglucose + ATP ---[Hexokinase]--> glucose 6-phosphate + ADPglucose 6-phosphate ---[Phosphoglucomutase]--> fructose 6-phosphatefructose 6-phosphate + ATP ---[Phosphofructokinase]--> fructose 1,6 bisphosphate + ADPfructose 1,6 bisphosphate ---[Aldolase]--> dihydrozyacetone phosphate + glyceraldehyde 3-phosphatephosphatate + NAD+ + glyceraldehyde 3-phosphate ---[Triose phosphate dehydrogenase]--> 1,3-bisphosphoglycerate1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]--> 3-phosphoglycerate + ATP[...same as above from this point onward...]

Page 43: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Step 3. Order Pathways by Likelihood Given Data

www.affymetrix.com/

fructose ---[Fructokinase]--> fructose 1-phosphatefructose 1-phosphate ---[Fructose 1-phosphate aldolase]--> glyceraldehyde + dihydrozyacetone phosphatedihydrozyacetone phosphate ---[Isomerase]--> glyceraldehyde 3-phosphatephosphatate + NAD+ + glyceraldehyde 3-phosphate ---[Triose phosphate dehydrogenase]--> NADH + H+ + 1,3-bisphosphoglycerate1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]--> 3-phosphoglycerate + ATP3-phosphoglycerate ---[Phosphoglyceromutase]--> 2-phosphoglycerate2-phosphoglycerate ---[Enolase]--> phosphoenolpyruvate + H2Ophosphoenolpyruvate + ATP ---[Pyruvate kinase]--> pyruvate + ADPmalate + NAD+ ---[Malate dehydrogenase]--> oxaloacetate + NADH + H+pyruvate + NAD+ + CoA ---[NIL]--> NADH + H+ + Co2 + acetyl CoAacetyl CoA + oxaloacetate ---[Citrate synthase]--> citrate + CoAcitrate ---[Aconitase]--> isocitrateisocitrate + NAD+ ---[Isocitrate dehydrogenase]--> a-ketoglutarate + NADH + H+ + Co2a-ketoglutarate + NAD+ + CoA ---[a-ketogluterate dehydrogenase complex]--> succinyl CoA + NADH + H+ + Co2succinyl CoA + GDP + phosphatate ---[Succinyl CoA synthase]--> succinate + GTP + CoAsuccinate + FAD ---[Succinate dehydrogenase]--> fumarate + FADH2fumarate + H2O ---[Fumerase]--> malate

Page 44: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Step 4. Simulate Natural or Experimental Knockouts

glucose + ATP ---[Hexokinase]--> glucose 6-phosphate + ADPglucose 6-phosphate ---[Phosphoglucomutase]--> fructose 6-phosphatefructose 6-phosphate + ATP ---[Phosphofructokinase]--> fructose 1,6 bisphosphate + ADPfructose 1,6 bisphosphate ---[Aldolase]--> dihydrozyacetone phosphate + glyceraldehyde 3-phosphatephosphatate + NAD+ + glyceraldehyde 3-phosphate ---[Triose phosphate dehydrogenase]--> 1,3-bisphosphoglycerate1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]--> 3-phosphoglycerate + ATP3-phosphoglycerate ---[Phosphoglyceromutase]--> 2-phosphoglycerate2-phosphoglycerate ---[Enolase]--> phosphoenolpyruvate + H2Ophosphoenolpyruvate + ATP ---[Pyruvate kinase]--> pyruvate + ADPmalate + NAD+ ---[Malate dehydrogenase]--> oxaloacetate + NADH + H+pyruvate + NAD+ + CoA ---[NIL]--> NADH + H+ + Co2 + acetyl CoAacetyl CoA + oxaloacetate ---[Citrate synthase]--> citrate + CoAcitrate ---[Aconitase]--> isocitrateisocitrate + NAD+ ---[Isocitrate dehydrogenase]--> a-ketoglutarate + NADH + H+ + Co2a-ketoglutarate + NAD+ + CoA ---[a-ketogluterate dehydrogenase complex]--> succinyl CoA + NADH + H+ + Co2succinyl CoA + GDP + phosphatate ---[Succinyl CoA synthase]--> succinate + GTP + CoAsuccinate + FAD ---[Succinate dehydrogenase]--> fumarate + FADH2fumarate + H2O ---[Fumerase]--> malate

1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]-->

3-phosphoglycerate + ATP

Knockout:

Page 45: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Step 5. Propose Bridging Reactions

Abstract ChemicialKnowledge +

ATP

6 Carbons0 Phosphates

6 Carbons1 Phosphate

3 Phosphates 2 Phosphates

glucose + ATP ---[Hexokinase]-->

glucose 6-phosphate + ADP

ADP

Abstract Balance

Constrained Search

Page 46: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

25 plausible (single) “bridging” reactions are proposed:<CYTOSOLIC:glyceraldehyde 3-phosphate ---[]--> 3-phosphoglycerate> <CYTOSOLIC:dihydrozyacetone phosphate ---[]--> 3-phosphoglycerate> <CYTOSOLIC:fructose 1,6 bisphosphate ---[]--> phosphoenolpyruvate + 3-phosphoglycerate> <CYTOSOLIC:fructose 1,6 bisphosphate ---[]--> 2-phosphoglycerate + 3-phosphoglycerate> <CYTOSOLIC:fructose 1,6 bisphosphate ---[]--> 3-phosphoglycerate + 3-phosphoglycerate> <CYTOSOLIC:ATP + fructose 1,6 bisphosphate ---[]--> ADP + 1,3-bisphosphoglycerate + 3-phosphoglycerate> <CYTOSOLIC:fructose 1,6 bisphosphate ---[]--> glyceraldehyde 3-phosphate + 3-phosphoglycerate> <CYTOSOLIC:fructose 1,6 bisphosphate ---[]--> dihydrozyacetone phosphate + 3-phosphoglycerate> <CYTOSOLIC:ADP + frucose 1,6 bisphosphate ---[]--> ATP + Co2 + acetyl + 3-phosphoglycerate>

<CYTOSOLIC:ADP + 1,3-bisphosphoglycerate ---[]--> ATP + 3-phosphoglycerate>

<CYTOSOLIC:ADP + fructose 1,6 bisphosphate ---[]--> ATP + pyruvate + 3-phosphoglycerate> <CYTOSOLIC:ADP + fructose 1,6 bisphosphate ---[]--> ATP + glycerate + 3-phosphoglycerate> <CYTOSOLIC:ADP + fructose 1,6 bisphosphate ---[]--> ATP + glyceraldehyde + 3-phosphoglycerate> <CYTOSOLIC:ADP + fructose 1,6 bisphosphate ---[]--> ATP + dihydroxyacetone + 3-phosphoglycerate> <CYTOSOLIC:ATP + glucose 6-phosphate ---[]--> ADP + phosphoenolpyruvate + 3-phosphoglycerate> <CYTOSOLIC:ATP + glucose 6-phosphate ---[]--> ADP + 2-phosphoglycerate + 3-phosphoglycerate> <CYTOSOLIC:ATP + glucose 6-phosphate ---[]--> ADP + 3-phosphoglycerate + 3-phosphoglycerate> <CYTOSOLIC:ATP + glucose 6-phosphate ---[]--> ADP + glyceraldehyde 3-phosphate + 3-phosphoglycerate> <CYTOSOLIC:ATP + glucose 6-phosphate ---[]--> ADP + dihydrozyacetone phosphate + 3-phosphoglycerate> <CYTOSOLIC:glucose 6-phosphate ---[]--> Co2 + acetyl + 3-phosphoglycerate> <CYTOSOLIC:glucose 6-phosphate ---[]--> pyruvate + 3-phosphoglycerate> <CYTOSOLIC:glucose 6-phosphate ---[]--> glycerate + 3-phosphoglycerate> <CYTOSOLIC:glucose 6-phosphate ---[]--> glyceraldehyde + 3-phosphoglycerate> <CYTOSOLIC:glucose 6-phosphate ---[]--> dihydroxyacetone + 3-phosphoglycerate> <CYTOSOLIC:glucose + ATP ---[]--> 1,3-bisphosphoglycerate + 3-phosphoglycerate>

1,3-bisphosphoglycerate + ADP ---[Phosphoglycerate kinase]-->

3-phosphoglycerate + ATP

Knockout:Step 5. Propose Bridging Reactions

Page 47: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

www.bio.davidson.edu/Biology

Step 6. Order Bridging Reactions by Likelihood

Homology of hexokinase across species:

We also measure similarity in structure between each bridging reaction and the knocked out reaction.

Page 48: Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Microarray Data on Photosynthetic Regulation