37
Pat Langley Pat Langley Institute for the Study of Learning and Expertise Institute for the Study of Learning and Expertise Palo Alto, California Palo Alto, California and and Center for the Study of Language and Information Center for the Study of Language and Information Stanford University, Stanford, California Stanford University, Stanford, California http://www.isle.org/~langley http://www.isle.org/~langley [email protected] [email protected] Computational Discovery of Computational Discovery of Communicable Scientific Communicable Scientific Knowledge Knowledge S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J cher, and A. Torregrosa. cher, and A. Torregrosa.

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Embed Size (px)

DESCRIPTION

Computational Discovery of Communicable Scientific Knowledge. Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://www.isle.org/ ~ langley [email protected]. - PowerPoint PPT Presentation

Citation preview

Pat LangleyPat LangleyInstitute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise

Palo Alto, CaliforniaPalo Alto, Californiaandand

Center for the Study of Language and InformationCenter for the Study of Language and InformationStanford University, Stanford, CaliforniaStanford University, Stanford, California

http://www.isle.org/~langleyhttp://www.isle.org/~langley

[email protected]@isle.org

Computational Discovery of Computational Discovery of Communicable Scientific KnowledgeCommunicable Scientific Knowledge

Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager, Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager, M. Schwabacher, and A. Torregrosa. M. Schwabacher, and A. Torregrosa.

Motivations for Computational DiscoveryMotivations for Computational Discovery

better predict and control future eventsbetter predict and control future events

understand both previous and future eventsunderstand both previous and future events

communicate that understanding to otherscommunicate that understanding to others

Humans strive to discover new knowledge from experience so that Humans strive to discover new knowledge from experience so that they can:they can:

Computational techniques should let us automate and/or assist this Computational techniques should let us automate and/or assist this discovery process.discovery process.

Recent research on computer-aided discovery has focused on some Recent research on computer-aided discovery has focused on some of these issues but downplayed others.of these issues but downplayed others.

The Data Mining ParadigmThe Data Mining Paradigm

emphasizing the availability of vast amounts of data;emphasizing the availability of vast amounts of data;

drawing on heuristic search methods to find regularities in drawing on heuristic search methods to find regularities in these data; these data;

using formalisms like decision trees, association rules, and using formalisms like decision trees, association rules, and Bayes nets to describe those regularities.Bayes nets to describe those regularities.

One computational discovery paradigm, known as One computational discovery paradigm, known as datadata miningmining or or KDDKDD, can be best characterized as:, can be best characterized as:

Thus, most KDD researchers favor their own formalisms over Thus, most KDD researchers favor their own formalisms over those used by scientists and engineers.those used by scientists and engineers.

As a result, their discoveries are seldom very As a result, their discoveries are seldom very communicablecommunicable to to members of those communities.members of those communities.

Myths About UnderstandabilityMyths About Understandability

decision trees and rules are inherently understandabledecision trees and rules are inherently understandable

because logical formalisms are easier to interpret than other because logical formalisms are easier to interpret than other notations.notations.

Within the data mining paradigm, one quite popular myth is that:Within the data mining paradigm, one quite popular myth is that:

However, Kononenko found that doctors felt that naïve Bayesian However, Kononenko found that doctors felt that naïve Bayesian classifiers were easier to interpret than decision trees.classifiers were easier to interpret than decision trees.

Conclusion:Conclusion: Any formalism’s understandability depends on the Any formalism’s understandability depends on the interpreter’s familiarity with that formalism. interpreter’s familiarity with that formalism.

Myths About UnderstandabilityMyths About Understandability

connectionist methods produce results that are opaqueconnectionist methods produce results that are opaque

because the set of weights they learn cannot be easily because the set of weights they learn cannot be easily interpreted.interpreted.

Another popular myth in the data mining community is that:Another popular myth in the data mining community is that:

However, Saito and Nakano However, Saito and Nakano (1997)(1997) have shown that one can use have shown that one can use such methods to discover explicit numeric equations.such methods to discover explicit numeric equations.

Conclusion:Conclusion: Understandability depends on the resulting Understandability depends on the resulting formalism, not on the search method used to discover knowledge. formalism, not on the search method used to discover knowledge.

Computational Scientific DiscoveryComputational Scientific Discovery

drawing on heuristic search to find regularities in scientific drawing on heuristic search to find regularities in scientific data, either historical or novel; data, either historical or novel;

using formalisms like numeric laws, structural models, and using formalisms like numeric laws, structural models, and reaction pathways to describe regularities.reaction pathways to describe regularities.

An older paradigm, An older paradigm, computational scientific discoverycomputational scientific discovery, can be , can be characterized as:characterized as:

Thus, researchers in this framework favor representations used by Thus, researchers in this framework favor representations used by scientists and engineers.scientists and engineers.

As a result, their systems’ discoveries are more As a result, their systems’ discoveries are more communicablecommunicable to to members of those communities.members of those communities.

Time Line for Research on Time Line for Research on Computational Scientific DiscoveryComputational Scientific Discovery

1989 19901979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Bacon.1–Bacon.5Abacus,

CoperFahrehneit, E*,

Tetrad, IDSN

Hume,ARC

DST, GPN

LaGrangeSDS

SSF, RF5,LaGramge

Dalton, Stahl

RL, Progol

Gell-MannBR-3,

MendelPauli

Stahlp,Revolver

Dendral

AM Glauber NGlauberIDSQ,

Live

IECoast, Phineas,AbE, Kekada

Mechem, CDPAstra,GPM

HR

BR-4

Numeric laws Qualitative laws Structural models Process modelsLegendLegend

Successes of Computational Scientific DiscoverySuccesses of Computational Scientific Discovery

Over the past decade, systems of this type have helped discover Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: new knowledge in many scientific fields:

• stellar taxonomies from infrared spectra (Cheeseman et al., 1989)stellar taxonomies from infrared spectra (Cheeseman et al., 1989)

• qualitative chemical factors in mutagenesis (King et al., 1996)qualitative chemical factors in mutagenesis (King et al., 1996)

• quantitative laws of metallic behavior (Sleeman et al., 1997)quantitative laws of metallic behavior (Sleeman et al., 1997)

• qualitative conjectures in number theory (Colton et al., 2000)qualitative conjectures in number theory (Colton et al., 2000)

• temporal laws of ecological behavior (Todorovski et al., 2000)temporal laws of ecological behavior (Todorovski et al., 2000)

• reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)

Each of these has led to publications in the refereed literature Each of these has led to publications in the refereed literature of the relevant scientific field (see Langley, 2000). of the relevant scientific field (see Langley, 2000).

The Developer’s Role in Computational DiscoveryThe Developer’s Role in Computational Discovery

problemformulation

representationengineering

datamanipulation

algorithmmanipulation

filtering andinterpretation

algorithminvocation

Themes of the ResearchThemes of the Research

generating generating explanationsexplanations that involve hidden objects/variables that involve hidden objects/variables revisingrevising existing models rather than starting from scratch existing models rather than starting from scratch drawing on drawing on domain knowledgedomain knowledge to constrain the search process to constrain the search process developing developing interactiveinteractive discovery tools for use by scientists discovery tools for use by scientists

We aim to extend previous approaches to computational scientific We aim to extend previous approaches to computational scientific discovery by:discovery by:

Two promising fields in which to pursue this research agenda are Two promising fields in which to pursue this research agenda are Earth science and molecular biology. Earth science and molecular biology.

As in earlier work, the notation for discovered knowledge will be As in earlier work, the notation for discovered knowledge will be the same as that used by domain scientists.the same as that used by domain scientists.

Some Interesting Questions in Earth ScienceSome Interesting Questions in Earth Science

What environmental variables determine the production of What environmental variables determine the production of carbon and the generation of various gases?carbon and the generation of various gases?

What functional forms relate these predictive variables to the What functional forms relate these predictive variables to the ones they influence? ones they influence?

How do extreme values of these variables affect behavior of How do extreme values of these variables affect behavior of the ecosystem? the ecosystem?

Are the Earth ecosystem parameters constant or have values Are the Earth ecosystem parameters constant or have values changed in recent years? changed in recent years?

GivenGiven: Observations about numeric variables (rainfall, sunlight, : Observations about numeric variables (rainfall, sunlight, temperature, NPPc) as they change over space and time.temperature, NPPc) as they change over space and time.

GivenGiven: Inferred values for global parameters and intrinsic properties : Inferred values for global parameters and intrinsic properties associated with discrete variables (e.g., ground cover).associated with discrete variables (e.g., ground cover).

The Task of Ecological Model RevisionThe Task of Ecological Model Revision

GivenGiven: A model of Earth’s ecosystem (CASA) stated as equations : A model of Earth’s ecosystem (CASA) stated as equations that involve observable and hidden variables.that involve observable and hidden variables.

FindFind: A revised ecosystem model with altered equations and/or : A revised ecosystem model with altered equations and/or parametric values that fits the data better.parametric values that fits the data better.

The NPPc Portion of CASAThe NPPc Portion of CASA

NPPc = NPPc = monthmonth max max (E(E··IPAR, 0)IPAR, 0)

E = 0.56 · T1 · T2 · WE = 0.56 · T1 · T2 · W

T1 = 0.8 + 0.02 · Topt – 0.0005 · ToptT1 = 0.8 + 0.02 · Topt – 0.0005 · Topt22

T2 = 1.18 / [(1 + T2 = 1.18 / [(1 + ee 0.2 · (Topt – Tempc – 10)0.2 · (Topt – Tempc – 10) ) · (1 + ) · (1 + ee 0.3 · (Tempc – Topt – 10)0.3 · (Tempc – Topt – 10) )] )]

W = 0.5 + 0.5 · EET / PETW = 0.5 + 0.5 · EET / PET

PET = 1.6 · (10 · Tempc / AHI)PET = 1.6 · (10 · Tempc / AHI)AA · PET-TW-M if Tempc > 0 · PET-TW-M if Tempc > 0

PET = 0 if Tempc < 0PET = 0 if Tempc < 0

A = 0.00000068 · AHIA = 0.00000068 · AHI33 – 0.000077 · AHI – 0.000077 · AHI22 + 0.018 · AHI + 0.49 + 0.018 · AHI + 0.49

IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-ConverIPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver

FPAR-FAS = FPAR-FAS = minmin [(SR-FAS – 1.08) / [(SR-FAS – 1.08) / SRSR (UMD-VEG) , 0.95] (UMD-VEG) , 0.95]

SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

The NPPc Portion of CASAThe NPPc Portion of CASA

NPPc

IPAR

PET

T1T2We_max

E

EET

Tempc

Topt

NDVI

SOLAR

AHI

A

PETTWM

SR

FPAR

VEG

Improving the NPPc Portion of CASAImproving the NPPc Portion of CASA

1. Transform the model into a multilayer neural network that 1. Transform the model into a multilayer neural network that makes the same predictions.makes the same predictions.

2. Identify portions of the model that are candidates for revision.2. Identify portions of the model that are candidates for revision.

3. Use an error-driven connectionist learning algorithm to revise 3. Use an error-driven connectionist learning algorithm to revise those portions of the model. those portions of the model.

4. Transform the revised multilayer network back into numeric 4. Transform the revised multilayer network back into numeric equations using the improved components.equations using the improved components.

One way to improve the NPPc model’s fit to observed data is to: One way to improve the NPPc model’s fit to observed data is to:

This approach is similar to Towell’s (1991) method for revising This approach is similar to Towell’s (1991) method for revising qualitative models. qualitative models.

The RF6 Discovery AlgorithmThe RF6 Discovery Algorithm

1. Creates a multilayer neural network that links predictive with 1. Creates a multilayer neural network that links predictive with predicted variables using additive and product units.predicted variables using additive and product units.

2. Invokes the BPQ algorithm to search through the weight space 2. Invokes the BPQ algorithm to search through the weight space defined by this network.defined by this network.

They have shown this approach can discover an impressive class They have shown this approach can discover an impressive class of numeric equations from noisy data.of numeric equations from noisy data.

Saito and Nakano (2000) describe RF6, a discovery system that: Saito and Nakano (2000) describe RF6, a discovery system that:

3. Transforms the resulting network into a polynomial equation 3. Transforms the resulting network into a polynomial equation

of the form of the form yy = = c cii x x jjd d ij ij ..

Three Facets of Model RevisionThree Facets of Model Revision

Altering the value of parameters in a specified equation;Altering the value of parameters in a specified equation;

Changing the associated values for an intrinsic property; andChanging the associated values for an intrinsic property; and

Replacing the equation for a term with another expression.Replacing the equation for a term with another expression.

Rather than initializing weights randomly, the system starts with Rather than initializing weights randomly, the system starts with weights based on parameters in the original model.weights based on parameters in the original model.

We have applied this strategy to revise six different portions of the We have applied this strategy to revise six different portions of the NPPc submodel. NPPc submodel.

We have adapted RF6 to revise an existing quantitative model in We have adapted RF6 to revise an existing quantitative model in three distinct ways:three distinct ways:

Altering Parameters in the NPPc ModelAltering Parameters in the NPPc Model

Initial model:Initial model:

T2 = 1.18 / [(1 + e T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10)0.2 · (Topt – Tempc – 10) ) · (1 + e ) · (1 + e 0.3 · (Tempc – Topt – 10)0.3 · (Tempc – Topt – 10) )] )]

Cross-validated RMSE = 467.910Cross-validated RMSE = 467.910

Behavior: Behavior: Gaussian-like function of temperature difference.Gaussian-like function of temperature difference.

Revised model:Revised model:

T2 = 1.80 / [(1 + e T2 = 1.80 / [(1 + e 0.05 · (Topt – Tempc – 10.8)0.05 · (Topt – Tempc – 10.8) ) · (1 + e ) · (1 + e 0.3 · (Tempc – Topt – 90.33)0.3 · (Tempc – Topt – 90.33) )] )]

Cross-validated RMSE = 461.466 [ one percent reduction ]Cross-validated RMSE = 461.466 [ one percent reduction ]

Behavior: Behavior: nearly flat function in actual range of temperature difference.nearly flat function in actual range of temperature difference.

Conclusion:Conclusion: The T2 temperature stress term contributes little to the The T2 temperature stress term contributes little to theoverall predictive ability of the NPPc submodel.overall predictive ability of the NPPc submodel.

Revising Intrinsic Values in the ModelRevising Intrinsic Values in the Model

The NPPc submodel includes one intrinsic property, SR, associated with The NPPc submodel includes one intrinsic property, SR, associated with the variable for vegetation type, UMD-VEG.the variable for vegetation type, UMD-VEG.

The corresponding RF6 network includes one hidden node for SR and The corresponding RF6 network includes one hidden node for SR and one dummy input variable for each vegetation type.one dummy input variable for each vegetation type.

Veg type A B C D E F G H I J KVeg type A B C D E F G H I J K

Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05

RevisedRevised 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60

RMSE = 467.910 RMSE = 467.910 for the original model;for the original model;RMSE = 448.376 RMSE = 448.376 for the revised model, an improvement of four percent.for the revised model, an improvement of four percent.

Observation:Observation: Nearly all intrinsic values are lower in the revised model. Nearly all intrinsic values are lower in the revised model.

Revising Equations in the NPPc ModelRevising Equations in the NPPc Model

Initial model:Initial model:

E = 0.56 · T1 · T2 · WE = 0.56 · T1 · T2 · W

Cross-validated RMSE = 467.910Cross-validated RMSE = 467.910

Behavior: Behavior: Each stress term decreases the photosynthetic efficiency E.Each stress term decreases the photosynthetic efficiency E.

Revised model:Revised model:

E = 0.521 · T1E = 0.521 · T10.000.00 · T2 · T2 0.030.03 · W · W 0.000.00

Cross-validated RMSE = 446.270 [ five percent reduction ]Cross-validated RMSE = 446.270 [ five percent reduction ]

Behavior: Behavior: T1 and W have no effect on E and T2 has only a minor effect .T1 and W have no effect on E and T2 has only a minor effect .

Conclusion:Conclusion: The stress terms are not useful to the NPPc model, most The stress terms are not useful to the NPPc model, mostlikely because of recent improvements in NDVI measures.likely because of recent improvements in NDVI measures.

Future Work on Ecological Model RevisionFuture Work on Ecological Model Revision

Apply the revision method to other parts of NPPc submodel Apply the revision method to other parts of NPPc submodel and other static parts of CASA model. and other static parts of CASA model.

Extend the revision method to improve parts of CASA that Extend the revision method to improve parts of CASA that involve difference equations. involve difference equations.

Develop software for visualizing both spatial and temporal Develop software for visualizing both spatial and temporal anomalies, as well as relating them to the model. anomalies, as well as relating them to the model.

Implement an interactive system that lets scientists direct Implement an interactive system that lets scientists direct high-level search for improved ecosystem models.high-level search for improved ecosystem models.

Visualizing an Improved ModelVisualizing an Improved Model

One way to visualize a model involves plotting its rules spatially.One way to visualize a model involves plotting its rules spatially.

Our Earth science collaborators found this useful, as regions often Our Earth science collaborators found this useful, as regions often correspond to recognizable ecological zones. correspond to recognizable ecological zones.

Some Interesting Biological QuestionsSome Interesting Biological Questions

How do organisms acclimate to increased temperature or How do organisms acclimate to increased temperature or ultraviolet radiation?ultraviolet radiation?

Why do we observe bleaching of plant cells under high Why do we observe bleaching of plant cells under high light conditions?light conditions?

What differences in biological processes exist between a What differences in biological processes exist between a mutant organism and the original?mutant organism and the original?

What are the effects on an organism’s biological processes What are the effects on an organism’s biological processes when one of its important genes is removed? when one of its important genes is removed?

Modeling Microarrary Results on PhotosynthesisModeling Microarrary Results on Photosynthesis

GivenGiven: Knowledge about the genes in Cyanobacteria relevant to : Knowledge about the genes in Cyanobacteria relevant to the photosynthetic process.the photosynthetic process.

GivenGiven: Observed expression levels, over time, of the organism’s : Observed expression levels, over time, of the organism’s genes in the presence of high ultraviolet light.genes in the presence of high ultraviolet light.

FindFind: A revised model with altered reactions and regulations that : A revised model with altered reactions and regulations that explains the expression levels and bleaching.explains the expression levels and bleaching.

GivenGiven: Qualitative knowledge about reactions and regulations for : Qualitative knowledge about reactions and regulations for Cyanobacteria in a high light situation. Cyanobacteria in a high light situation.

How do plants modify their photosynthetic apparatus in high light?How do plants modify their photosynthetic apparatus in high light?

A Model of Photosynthesis RegulationA Model of Photosynthesis Regulation

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++

++

--

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

Collecting Data on Photosynthetic ProcessesCollecting Data on Photosynthetic Processes

Stress (e.g., High Light)Stress (e.g., High Light)

Adaptation PeriodAdaptation Period

Sampling mRNA/cDNASampling mRNA/cDNA

Equlibrium PeriodEqulibrium Period

MicroarrayMicroarrayTraceTrace

Continuous Culture (Chemostat)Continuous Culture (Chemostat)

/wwwscience.murdoch.edu.au/teach

www.affymetrix.com/

www.affymetrix.com/

Hea

lth

of C

ultu

reH

ealt

h of

Cul

ture

TimeTime

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6

NBLR

NBLA

cpcB

psbA2

psbA1

DFR

PBS

Microarray Data on Photosynthetic RegulationMicroarray Data on Photosynthetic Regulation

Revising a Model of Gene RegulationRevising a Model of Gene Regulation

Starting state:Starting state: Initial model proposed by the biologist Initial model proposed by the biologist

Operators:Operators: Add a link, delete a link, determine sign on a link Add a link, delete a link, determine sign on a link

Control:Control: Greedy search for N steps to determine link structure; Greedy search for N steps to determine link structure; Exhaustive search to determine best signs on linksExhaustive search to determine best signs on links

Evaluation:Evaluation: Agreement with predicted relations among partial Agreement with predicted relations among partial correlations, similar to those used in Tetradcorrelations, similar to those used in Tetrad

Our approach carries out heuristic search through the model space, Our approach carries out heuristic search through the model space, guided by candidates’ abilities to explain the data:guided by candidates’ abilities to explain the data:

To reduce variance, the system repeats this process using bootstrap To reduce variance, the system repeats this process using bootstrap sampling and only makes changes that occur in 75% of the models.sampling and only makes changes that occur in 75% of the models.

Greedy Search Through a Space of ModelsGreedy Search Through a Space of Models

Initial model

Revision 1.1 Revision 1.2 Revision 1.3 Revision 1.4

Revision 2.1 Revision 2.2 Revision 2.3 Revision 2.4

Revision 3.1 Revision 3.2 Revision 3.3 Revision 3.4

--

++

Changes to the model improve its match to the expression data.Changes to the model improve its match to the expression data.

A Revised Model of Photosynthesis RegulationA Revised Model of Photosynthesis Regulation

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++--

LightLight

++ × ×

Similar changes adapt the model to expression data from mutants.Similar changes adapt the model to expression data from mutants.

Future Work on Biological ModelingFuture Work on Biological Modeling

Add more knowledge about biochemical pathways and use to Add more knowledge about biochemical pathways and use to interpret other microarray data (e.g., rat metabolism, cancer).interpret other microarray data (e.g., rat metabolism, cancer).

Introduce taxonomic knowledge to limit the search process and Introduce taxonomic knowledge to limit the search process and improve final models.improve final models.

Expand modeling formalism to support biological mechanisms Expand modeling formalism to support biological mechanisms in addition to abstract processes. in addition to abstract processes.

Implement an interactive system that lets scientists direct high-Implement an interactive system that lets scientists direct high-level search for improved biological process models.level search for improved biological process models.

Concluding RemarksConcluding Remarks

attempts to move beyond description and prediction to both attempts to move beyond description and prediction to both explanation and understanding;explanation and understanding;

uses domain knowledge to initialize search and to characterize uses domain knowledge to initialize search and to characterize differences from revised model;differences from revised model;

presents the new knowledge in some presents the new knowledge in some communicablecommunicable notation notation that is familiar to domain experts.that is familiar to domain experts.

In summary, unlike work in the data mining paradigm, our research In summary, unlike work in the data mining paradigm, our research on computational discovery:on computational discovery:

This approach seems especially appropriate for manipulating and This approach seems especially appropriate for manipulating and understanding complex scientific and engineering data. understanding complex scientific and engineering data.

In MemoriamIn Memoriam

Herbert A. Simon (1916 – 2001)Herbert A. Simon (1916 – 2001)

Jan M. Zytkow (1945 – 2001)Jan M. Zytkow (1945 – 2001)

Earlier this year, computational scientific discovery lost two of Earlier this year, computational scientific discovery lost two of its founding fathers:its founding fathers:

Both contributed to the field in many ways: posing new problems, Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings.inventing methods, training students, and organizing meetings.

Moreover, both were interdisciplinary researchers who contributed Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics.to computer science, psychology, philosophy, and statistics.

Herb Simon and Jan Zytkow were excellent role models that we Herb Simon and Jan Zytkow were excellent role models that we should aim to emulate. should aim to emulate.

A Closing QuotationA Closing Quotation

We would like to imagine that the great discoverers, the scientists We would like to imagine that the great discoverers, the scientists whose behavior we are trying to understand, would be pleased with whose behavior we are trying to understand, would be pleased with this interpretation of their activity as normal (albeit high-quality) this interpretation of their activity as normal (albeit high-quality) human thinking. . . human thinking. . .

But science is concerned with the way the world is, not with how But science is concerned with the way the world is, not with how we would like it to be. So we must continue to try new experiments, we would like it to be. So we must continue to try new experiments, to be guided by new evidence, in a heuristic search that is never to be guided by new evidence, in a heuristic search that is never finished but always fascinating. finished but always fascinating.

Herbert A. Simon, Envoi to Herbert A. Simon, Envoi to Scientific DiscoveryScientific Discovery, 1987. , 1987.

Visualizing Errors in the ModelVisualizing Errors in the Model

We can easily plot an improved model’s errors in spatial terms.We can easily plot an improved model’s errors in spatial terms.

Such displays can help suggest causes for prediction errors and thus Such displays can help suggest causes for prediction errors and thus ways to further improve the model. ways to further improve the model.

Related Research on DiscoveryRelated Research on Discovery

equation discovery (Langley et al. 1983; Zytkow et al, 1990; equation discovery (Langley et al. 1983; Zytkow et al, 1990; Washio & Motoda, 1998; Todorovski & Dzeroski, 1997);Washio & Motoda, 1998; Todorovski & Dzeroski, 1997);

revision of qualitative models (Ourston & Mooney, 1990; revision of qualitative models (Ourston & Mooney, 1990; Towell, 1991);Towell, 1991);

revision of quantitative models (Glymour et al., 1987; Chown revision of quantitative models (Glymour et al., 1987; Chown & Dietterich, 2000).& Dietterich, 2000).

Our approach to computational scientific discovery borrows Our approach to computational scientific discovery borrows ideas from earlier work on:ideas from earlier work on:

However, our work combines these ideas in novel ways to produce However, our work combines these ideas in novel ways to produce a discovery system with new functionality. a discovery system with new functionality.