46
Causal Models for Scientific Discovery Research Challenges and Opportunities David Jensen College of Information and Computer Sciences Computational Social Science Institute Center for Data Science University of Massachusetts Amherst Symposium on Accelerating Science 18 November 2016

Causal Models for Scientific Discovery · • Compare a state-of the-art associational model (a random forest) to a CGM constructed using greedy equivalence search (GES) (Chickering

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • CausalModelsforScientificDiscovery 
ResearchChallengesandOpportunities

    DavidJensen
CollegeofInformationandComputerSciences


    ComputationalSocialScienceInstituteCenterforDataScience 


    UniversityofMassachusettsAmherst


    SymposiumonAcceleratingScience 
18November2016

  • Sources:TheGuardian,July2005;WallaceKirkland,forTime

  • Sources:Wikipedia(pile);ArgonneNationalLaboratory(Fermi)

  • Mainpoints

    • Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.

    • Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.

    • Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom

    multiplesourcestounderstandcomplexphenomena

    • Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction

    • Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling

  • Causalityiscentral
toscience

  • Explanation⇒Causality

    • Explanationisacentralactivity
inscience.Effectivetheoriesexplainpreviouslyunexplainedphenomena

    • Effectiveexplanationsgenerallytaketheformofacounterfactual
(“Whatwouldhavehappenedif
conditionshadbeendifferent?”).

    • “…explanatoryrelationshipsarerelationshipsthatarepotentiallyexploitableforpurposesofmanipulationandcontrol.”

  • Control&design⇒Causality

    Sources:Wikipedia(pile)

  • Models

    • Becauseofthis,“models”in
mostscientificfieldshavecausalimplications(inferhowasystemwouldbehaveunderintervention)

    • Incontrast,most“models”inmachinelearningandstatisticshavebeendefinedashavingonlyassociationalsemantics.

    • Thisleadstosubstantialconfusionamongresearchersfromother
fieldswhenfirstencountering
machinelearningmethods.

  • Progressincausalmodeling

    • Anexplicittheoryofcausalinferencehasbeenworkedoutoverthepast20years 
byasmallgroupofcomputer
scientists,philosophers,
andstatisticians.

    • Thetheoryusesdirected 
graphicalmodelstorepresent 
causaldependenceamongvariables.

    • Thattheoryprovidesaformalcorrespondence
betweencausalmodelsandtheirobservablestatisticalimplications.Thiscorrespondencehasbeenexploitedtoproduceanumberofalgorithmsforreasoningwithcausalgraphicalmodels(CGMs).

    (Pearl2000,2009;Spirtes,Glymour,andScheines1993,2001)

  • Keyconcepts

    • Onlystatisticaldependenceisdirectlyobservableindata.Causaldependenceisnotobservable.

    • Statisticaldependenceunderdeterminescausaldependence(“correlationisnotcausation”)

    • Theobservablestatisticalconsequencesofagivencausalmodelcanbeinferredfromstructure(d-separation)

    • Multiplecausalstructuresproducethesameobservedstatisticaldependencies(Markovequivalence).

    • However,somecombinationsofconditionalindependenceandknowncausaldependenceimplyconstraintsonthespaceofcausalstructures,andsomeuniquelyidentifycausalstructures

  • Mainpoints

    • Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.

    • Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.

    • Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom

    multiplesourcestounderstandcomplexphenomena

    • Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction

    • Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling

  • Expressiveness

  • Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda

  • Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda

  • ManualScientificPractice 
Rarelysearcheslargespaces


    offormallyrepresentedmodels

    MachineLearning
Rarelyanalyzes
causaldependence

    CausalDiscovery 
Rarelydiscoversrelational,temporal,orspatialmodels

    CausalAnalysis

    AutomatedDiscovery

    Relational, Temporaland Spatial Models

  • Causalmodelsofindependentoutcomes

    Causal 
Process Outcome Variables

    A

    B

    Z

    ...

  • Causalmodelsofindependentoutcomes

    I J

    HE

    D

    A

    F G

    B C

  • KeyassumptionofsimpleCGMs

    Causal 
Process Outcome Variables

    A

    B

    Z

    ...

  • KeyassumptionofsimpleCGMs

    Causal 
Process

    MultipleDependentOutcomes

    x

    x

    ?

  • Causalmodelsofindependentoutcomes

    I J

    HE

    D

    A

    F G

    B C

  • K

    K

    Causalmodelsofdependentoutcomes

    (Friedman,Getoor,Koller,&Pfeffer1999;Heckerman,Meek,&Koller2007;Maier,Marazopoulou,andJensen2013)

    I J

    HE

    D

    A

    F G

    B C

    x

    x

    K

    O

    R

    P

    S

    Q

    T

    L

    M

    N

  • (Maier,Marazopoulou,andJensen2013)

  • (Maier,Marazopoulou,andJensen2013)

  • (Maier,Marazopoulou,andJensen2013)

  • Causalmodelsofgeneralprocesses

    Causal 
Process

    1: bool c1, c2; 
2: int count = 0; 
3: c1 = Bernoulli(0.5); 
4: if (c1==true) then 
5: count = count + 1; 
6: c2 = Bernoulli(0.5); 
7: if (c2==true) then 
8: count = count + 1; 
9: observe(c1==true||c2==true); 
10: return(count);

    Probabilistic
Program

  • Critique

  • “[Tosupportscience,wewouldexpect]
thattwodifferentkindsofinferentialprocess


    wouldberequiredtoputitintoeffect.Thefirst,usedinestimatingparametersfromdataconditionalonthetruthofsometentativemodel,


    isappropriatelycalledEstimation.
Thesecond,usedincheckingwhether,inthelightofthedata,anymodelofthekindproposedisplausible,

    hasbeenaptlynamed…Criticism.”


    —GeorgeBox(emphasisadded)

  • Exampleassumptions

    • Faithfulness• CausalMarkovassumption• Definitionsofvariables,entities,relationships,etc.• Measurementprocess• Temporalgranularityofmeasurement• Latentvariables,entities,relationships,etc.• Structuralformofcausaldependence• Functionalformofprobabilisticdependence• Compositionalform• Closedworld(orformofopenworld)• …andmanyothers

  • Empiricalevaluation

  • GoalsforEmpiricalEvaluationApproaches

    • Empirical—Apre-existingsystemcreatedbysomeoneotherthantheresearchers.

    • Stochastic—Producesnon-deterministicexperimentalresults.

    • Identifiable—Amenabletodirectexperimentalinvestigationtoestimateinterventionaldistributions

    • Recoverable—Lacksmemoryorirreversibleeffects,whichenablescompletestaterecoveryduringexperiments.

    • Efficient—Generateslargeamountsofdatawithrelativelyfewresources.

    • Reproducible—Fairlyeasytorecreatenearlyidenticaldatasetswithoutaccesstoone-of-a-kindhardwareorsoftware.

  • Simpleexample:Databaseconfiguration

  • MLfordatabaseconfiguration(setup)

    • Assumeafixeddatabase 
andDBserverhardware

    • Questions• Foragivenquery,whatistheexpectedperformance

    undereachsetofconfigurationparameters?

    • Foragivenquery,whichconfigurationwillgivemethebestperformance?

    • Data• Run11,252queriesactuallyrunagainsttheStack

    ExchangeDataExplorer

    • EachqueryrunusingoneofmanydifferentjointvaluesoftheconfigurationparametersusingPostgres9.2.2

    (Garant&Jensen2016)

  • IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    CGMfordatabaseconfiguration

  • IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    DatabaseQuery

    ProcessingUser

    IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    CGMfordatabaseconfiguration

  • IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    DatabaseQuery

    ProcessingUser

    IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    CGMfordatabaseconfiguration

    DatabaseQuery

    ProcessingUser

    IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

  • DatabaseQuery

    ProcessingUser

    IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    CGMfordatabaseconfiguration

  • DatabaseQuery

    ProcessingUser

    IndexingPageCostMemory

    Level

    Block Writesto RAM

    YearCreated

    JoinCount

    Group-byCount

    Block Readsfrom RAM

    Runtime

    RetrievedRow Count

    TableCount

    Total RowCount

    Length

    Total Queriesby User

    Block Readsfrom Disk

    Block Hitsin Cache

    CGMfordatabaseconfiguration

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Cache Hits

    (Garant&Jensen2016)

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Cache Hits

    (Garant&Jensen2016)

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Disk Reads

    (Garant&Jensen2016)

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Disk Reads

    (Garant&Jensen2016)

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Runtime

    (Garant&Jensen2016)

  • Comparingassociationalandcausalmodels

    • Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)

    • Evaluateby
comparingto
“groundtruth” 
(experimental
resultsforall
queriesobtained 
usingaspecific 
jointsettingof
theconfiguration
parameters). Runtime

    (Garant&Jensen2016)

  • Mainpoints

    • Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.

    • Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.

    • Severalemergingopportunitiesandchallengesexist:• Expressiveness—Combiningdataandknowledgefrom

    multiplesourcestounderstandcomplexphenomena

    • Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction

    • Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling

  • Thanks

    DavidArbour—Recentdevelopmentsinlearningcausaldependencefrombivariatejointdistributionsinrelationaldata(UAI&KDD2016)

    DanGarant—Empiricalevaluationofalgorithmsforlearningcausalmodels(UAI2016)

    AmandaGentzel—Grangercausalitymethodsandempiricalevaluation

    KaterinaMarazopoulou—Extendingcausalsemanticstotemporalmodels(UAI2015;2016)

    KaleighClary—Additivenoisemodelsforlearningcausaldependencefrombivariatejointdistributions

  • [email protected] kdl.cs.umass.edu


    cs.umass.edu/~jensen/

    Allopinionsaremineandnotthoseofanycompany,agencyoftheUSGovernment,
ortheUniversityofMassachusettsAmherst.