Causal Models for Scientiﬁc Discovery · • Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering

CausalModelsforScientificDiscovery  ResearchChallengesandOpportunities

DavidJensen CollegeofInformationandComputerSciences 

ComputationalSocialScienceInstituteCenterforDataScience  

UniversityofMassachusettsAmherst 

SymposiumonAcceleratingScience  18November2016

Sources:TheGuardian,July2005;WallaceKirkland,forTime

Sources:Wikipedia(pile);ArgonneNationalLaboratory(Fermi)

Mainpoints

• Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.

• Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.

• Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom

multiplesourcestounderstandcomplexphenomena

• Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction

• Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling

Causalityiscentral toscience

Explanation⇒Causality

• Explanationisacentralactivity inscience.Effectivetheoriesexplainpreviouslyunexplainedphenomena

• Effectiveexplanationsgenerallytaketheformofacounterfactual (“Whatwouldhavehappenedif conditionshadbeendifferent?”).

• “…explanatoryrelationshipsarerelationshipsthatarepotentiallyexploitableforpurposesofmanipulationandcontrol.”

Control&design⇒Causality

Sources:Wikipedia(pile)

Models

• Becauseofthis,“models”in mostscientificfieldshavecausalimplications(inferhowasystemwouldbehaveunderintervention)

• Incontrast,most“models”inmachinelearningandstatisticshavebeendefinedashavingonlyassociationalsemantics.

• Thisleadstosubstantialconfusionamongresearchersfromother fieldswhenfirstencountering machinelearningmethods.

Progressincausalmodeling

• Anexplicittheoryofcausalinferencehasbeenworkedoutoverthepast20years  byasmallgroupofcomputer scientists,philosophers, andstatisticians.

• Thetheoryusesdirected  graphicalmodelstorepresent  causaldependenceamongvariables.

• Thattheoryprovidesaformalcorrespondence betweencausalmodelsandtheirobservablestatisticalimplications.Thiscorrespondencehasbeenexploitedtoproduceanumberofalgorithmsforreasoningwithcausalgraphicalmodels(CGMs).

(Pearl2000,2009;Spirtes,Glymour,andScheines1993,2001)

Keyconcepts

• Onlystatisticaldependenceisdirectlyobservableindata.Causaldependenceisnotobservable.

• Statisticaldependenceunderdeterminescausaldependence(“correlationisnotcausation”)

• Theobservablestatisticalconsequencesofagivencausalmodelcanbeinferredfromstructure(d-separation)

• Multiplecausalstructuresproducethesameobservedstatisticaldependencies(Markovequivalence).

• However,somecombinationsofconditionalindependenceandknowncausaldependenceimplyconstraintsonthespaceofcausalstructures,andsomeuniquelyidentifycausalstructures

Mainpoints







Expressiveness

Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda

ManualScientificPractice  Rarelysearcheslargespaces 

offormallyrepresentedmodels

MachineLearning Rarelyanalyzes causaldependence

CausalDiscovery  Rarelydiscoversrelational,temporal,orspatialmodels

CausalAnalysis

AutomatedDiscovery

Relational, Temporaland Spatial Models

Causalmodelsofindependentoutcomes

Causal  Process Outcome Variables

A

B

Z

...


I J

HE

D

A

F G

B C

KeyassumptionofsimpleCGMs

Causal  Process Outcome Variables

A

B

Z

...

KeyassumptionofsimpleCGMs

Causal  Process

MultipleDependentOutcomes

x

x

?


I J

HE

D

A

F G

B C

K

K

Causalmodelsofdependentoutcomes

(Friedman,Getoor,Koller,&Pfeffer1999;Heckerman,Meek,&Koller2007;Maier,Marazopoulou,andJensen2013)

I J

HE

D

A

F G

B C

x

x

K

O

R

P

S

Q

T

L

M

N

(Maier,Marazopoulou,andJensen2013)

Causalmodelsofgeneralprocesses

Causal  Process

1: bool c1, c2;  2: int count = 0;  3: c1 = Bernoulli(0.5);  4: if (c1==true) then  5: count = count + 1;  6: c2 = Bernoulli(0.5);  7: if (c2==true) then  8: count = count + 1;  9: observe(c1==true||c2==true);  10: return(count);

Probabilistic Program

Critique

“[Tosupportscience,wewouldexpect] thattwodifferentkindsofinferentialprocess 

wouldberequiredtoputitintoeffect.Thefirst,usedinestimatingparametersfromdataconditionalonthetruthofsometentativemodel, 

isappropriatelycalledEstimation. Thesecond,usedincheckingwhether,inthelightofthedata,anymodelofthekindproposedisplausible,

hasbeenaptlynamed…Criticism.” 

—GeorgeBox(emphasisadded)

Exampleassumptions

• Faithfulness• CausalMarkovassumption• Definitionsofvariables,entities,relationships,etc.• Measurementprocess• Temporalgranularityofmeasurement• Latentvariables,entities,relationships,etc.• Structuralformofcausaldependence• Functionalformofprobabilisticdependence• Compositionalform• Closedworld(orformofopenworld)• …andmanyothers

Empiricalevaluation

GoalsforEmpiricalEvaluationApproaches

• Empirical—Apre-existingsystemcreatedbysomeoneotherthantheresearchers.

• Stochastic—Producesnon-deterministicexperimentalresults.

• Identifiable—Amenabletodirectexperimentalinvestigationtoestimateinterventionaldistributions

• Recoverable—Lacksmemoryorirreversibleeffects,whichenablescompletestaterecoveryduringexperiments.

• Efficient—Generateslargeamountsofdatawithrelativelyfewresources.

• Reproducible—Fairlyeasytorecreatenearlyidenticaldatasetswithoutaccesstoone-of-a-kindhardwareorsoftware.

Simpleexample:Databaseconfiguration

MLfordatabaseconfiguration(setup)

• Assumeafixeddatabase  andDBserverhardware

• Questions• Foragivenquery,whatistheexpectedperformance

undereachsetofconfigurationparameters?

• Foragivenquery,whichconfigurationwillgivemethebestperformance?

• Data• Run11,252queriesactuallyrunagainsttheStack

ExchangeDataExplorer

• EachqueryrunusingoneofmanydifferentjointvaluesoftheconfigurationparametersusingPostgres9.2.2

(Garant&Jensen2016)

IndexingPageCostMemory

Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length

Total Queriesby User

Block Readsfrom Disk

Block Hitsin Cache

CGMfordatabaseconfiguration


Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache

DatabaseQuery

ProcessingUser


Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache



Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache

DatabaseQuery

ProcessingUser


Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache


DatabaseQuery

ProcessingUser


Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache

DatabaseQuery

ProcessingUser


Level

Block Writesto RAM

YearCreated

JoinCount

Group-byCount

Block Readsfrom RAM

Runtime

RetrievedRow Count

TableCount

Total RowCount

Length



Block Hitsin Cache


Comparingassociationalandcausalmodels

• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)

• Evaluateby comparingto “groundtruth”  (experimental resultsforall queriesobtained  usingaspecific  jointsettingof theconfiguration parameters). Cache Hits

(Garant&Jensen2016)



• Evaluateby comparingto “groundtruth”  (experimental resultsforall queriesobtained  usingaspecific  jointsettingof theconfiguration parameters). Disk Reads

(Garant&Jensen2016)



• Evaluateby comparingto “groundtruth”  (experimental resultsforall queriesobtained  usingaspecific  jointsettingof theconfiguration parameters). Runtime

(Garant&Jensen2016)

Mainpoints







Thanks

DavidArbour—Recentdevelopmentsinlearningcausaldependencefrombivariatejointdistributionsinrelationaldata(UAI&KDD2016)

DanGarant—Empiricalevaluationofalgorithmsforlearningcausalmodels(UAI2016)

AmandaGentzel—Grangercausalitymethodsandempiricalevaluation

KaterinaMarazopoulou—Extendingcausalsemanticstotemporalmodels(UAI2015;2016)

KaleighClary—Additivenoisemodelsforlearningcausaldependencefrombivariatejointdistributions

[email protected] kdl.cs.umass.edu 

cs.umass.edu/~jensen/

Allopinionsaremineandnotthoseofanycompany,agencyoftheUSGovernment, ortheUniversityofMassachusettsAmherst.

Documents

Causal Models for Scientiﬁc Discovery · • Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering