Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CausalModelsforScientificDiscovery ResearchChallengesandOpportunities
DavidJensen CollegeofInformationandComputerSciences
ComputationalSocialScienceInstituteCenterforDataScience
UniversityofMassachusettsAmherst
SymposiumonAcceleratingScience 18November2016
Sources:TheGuardian,July2005;WallaceKirkland,forTime
Sources:Wikipedia(pile);ArgonneNationalLaboratory(Fermi)
Mainpoints
• Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
• Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
• Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
• Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
Causalityiscentral toscience
Explanation⇒Causality
• Explanationisacentralactivity inscience.Effectivetheoriesexplainpreviouslyunexplainedphenomena
• Effectiveexplanationsgenerallytaketheformofacounterfactual (“Whatwouldhavehappenedif conditionshadbeendifferent?”).
• “…explanatoryrelationshipsarerelationshipsthatarepotentiallyexploitableforpurposesofmanipulationandcontrol.”
Control&design⇒Causality
Sources:Wikipedia(pile)
Models
• Becauseofthis,“models”in mostscientificfieldshavecausalimplications(inferhowasystemwouldbehaveunderintervention)
• Incontrast,most“models”inmachinelearningandstatisticshavebeendefinedashavingonlyassociationalsemantics.
• Thisleadstosubstantialconfusionamongresearchersfromother fieldswhenfirstencountering machinelearningmethods.
Progressincausalmodeling
• Anexplicittheoryofcausalinferencehasbeenworkedoutoverthepast20years byasmallgroupofcomputer scientists,philosophers, andstatisticians.
• Thetheoryusesdirected graphicalmodelstorepresent causaldependenceamongvariables.
• Thattheoryprovidesaformalcorrespondence betweencausalmodelsandtheirobservablestatisticalimplications.Thiscorrespondencehasbeenexploitedtoproduceanumberofalgorithmsforreasoningwithcausalgraphicalmodels(CGMs).
(Pearl2000,2009;Spirtes,Glymour,andScheines1993,2001)
Keyconcepts
• Onlystatisticaldependenceisdirectlyobservableindata.Causaldependenceisnotobservable.
• Statisticaldependenceunderdeterminescausaldependence(“correlationisnotcausation”)
• Theobservablestatisticalconsequencesofagivencausalmodelcanbeinferredfromstructure(d-separation)
• Multiplecausalstructuresproducethesameobservedstatisticaldependencies(Markovequivalence).
• However,somecombinationsofconditionalindependenceandknowncausaldependenceimplyconstraintsonthespaceofcausalstructures,andsomeuniquelyidentifycausalstructures
Mainpoints
• Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
• Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
• Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
• Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
Expressiveness
Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda
Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda
ManualScientificPractice Rarelysearcheslargespaces
offormallyrepresentedmodels
MachineLearning Rarelyanalyzes causaldependence
CausalDiscovery Rarelydiscoversrelational,temporal,orspatialmodels
CausalAnalysis
AutomatedDiscovery
Relational, Temporaland Spatial Models
Causalmodelsofindependentoutcomes
Causal Process Outcome Variables
A
B
Z
...
Causalmodelsofindependentoutcomes
I J
HE
D
A
F G
B C
KeyassumptionofsimpleCGMs
Causal Process Outcome Variables
A
B
Z
...
KeyassumptionofsimpleCGMs
Causal Process
MultipleDependentOutcomes
x
x
?
Causalmodelsofindependentoutcomes
I J
HE
D
A
F G
B C
K
K
Causalmodelsofdependentoutcomes
(Friedman,Getoor,Koller,&Pfeffer1999;Heckerman,Meek,&Koller2007;Maier,Marazopoulou,andJensen2013)
I J
HE
D
A
F G
B C
x
x
K
O
R
P
S
Q
T
L
M
N
(Maier,Marazopoulou,andJensen2013)
(Maier,Marazopoulou,andJensen2013)
(Maier,Marazopoulou,andJensen2013)
Causalmodelsofgeneralprocesses
Causal Process
1: bool c1, c2; 2: int count = 0; 3: c1 = Bernoulli(0.5); 4: if (c1==true) then 5: count = count + 1; 6: c2 = Bernoulli(0.5); 7: if (c2==true) then 8: count = count + 1; 9: observe(c1==true||c2==true); 10: return(count);
Probabilistic Program
Critique
“[Tosupportscience,wewouldexpect] thattwodifferentkindsofinferentialprocess
wouldberequiredtoputitintoeffect.Thefirst,usedinestimatingparametersfromdataconditionalonthetruthofsometentativemodel,
isappropriatelycalledEstimation. Thesecond,usedincheckingwhether,inthelightofthedata,anymodelofthekindproposedisplausible,
hasbeenaptlynamed…Criticism.”
—GeorgeBox(emphasisadded)
Exampleassumptions
• Faithfulness• CausalMarkovassumption• Definitionsofvariables,entities,relationships,etc.• Measurementprocess• Temporalgranularityofmeasurement• Latentvariables,entities,relationships,etc.• Structuralformofcausaldependence• Functionalformofprobabilisticdependence• Compositionalform• Closedworld(orformofopenworld)• …andmanyothers
Empiricalevaluation
GoalsforEmpiricalEvaluationApproaches
• Empirical—Apre-existingsystemcreatedbysomeoneotherthantheresearchers.
• Stochastic—Producesnon-deterministicexperimentalresults.
• Identifiable—Amenabletodirectexperimentalinvestigationtoestimateinterventionaldistributions
• Recoverable—Lacksmemoryorirreversibleeffects,whichenablescompletestaterecoveryduringexperiments.
• Efficient—Generateslargeamountsofdatawithrelativelyfewresources.
• Reproducible—Fairlyeasytorecreatenearlyidenticaldatasetswithoutaccesstoone-of-a-kindhardwareorsoftware.
Simpleexample:Databaseconfiguration
MLfordatabaseconfiguration(setup)
• Assumeafixeddatabase andDBserverhardware
• Questions• Foragivenquery,whatistheexpectedperformance
undereachsetofconfigurationparameters?
• Foragivenquery,whichconfigurationwillgivemethebestperformance?
• Data• Run11,252queriesactuallyrunagainsttheStack
ExchangeDataExplorer
• EachqueryrunusingoneofmanydifferentjointvaluesoftheconfigurationparametersusingPostgres9.2.2
(Garant&Jensen2016)
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Cache Hits
(Garant&Jensen2016)
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Cache Hits
(Garant&Jensen2016)
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Disk Reads
(Garant&Jensen2016)
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Disk Reads
(Garant&Jensen2016)
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Runtime
(Garant&Jensen2016)
Comparingassociationalandcausalmodels
• Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering &Meek2002)
• Evaluateby comparingto “groundtruth” (experimental resultsforall queriesobtained usingaspecific jointsettingof theconfiguration parameters). Runtime
(Garant&Jensen2016)
Mainpoints
• Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
• Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
• Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
• Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
Thanks
DavidArbour—Recentdevelopmentsinlearningcausaldependencefrombivariatejointdistributionsinrelationaldata(UAI&KDD2016)
DanGarant—Empiricalevaluationofalgorithmsforlearningcausalmodels(UAI2016)
AmandaGentzel—Grangercausalitymethodsandempiricalevaluation
KaterinaMarazopoulou—Extendingcausalsemanticstotemporalmodels(UAI2015;2016)
KaleighClary—Additivenoisemodelsforlearningcausaldependencefrombivariatejointdistributions
[email protected] kdl.cs.umass.edu
cs.umass.edu/~jensen/
Allopinionsaremineandnotthoseofanycompany,agencyoftheUSGovernment, ortheUniversityofMassachusettsAmherst.