Upload
edda
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Course files. http://www.andrew.cmu.edu/~ddanks/NASSLLI/. Principles Underlying Causal Search Algorithms. Fundamental problem. As we have all heard many times… “ Correlation is not causation! ”. Fundamental problem. Why is this slogan correct? - PowerPoint PPT Presentation
Citation preview
Course files
http://www.andrew.cmu.edu/~ddanks/NASSLLI/
Principles Underlying Causal Search Algorithms
Fundamental problem
As we have all heard many times…
“Correlation is not causation!”
Fundamental problem
Why is this slogan correct? Causal hypotheses make implicit claims
about the effects of intervening (manipulating) one or more variables
Hypotheses about association or correlation make no such claims Correlation or probabilistic dependence can be
produced in many ways
Fundamental problem
Some of the possible reasons why X and Y might be associated are: Sheer chance X causes Y Y causes X Some third variable Z influences X and Y The value of X (or a cause of X) and the
value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)
Fundamental problem
Fundamental problem of causal search: For any particular set of data,
there are often many different causal structures that could have produced that data
Causation → Association map is many → one
Fundamental problem
Okay, so what can we do about this? Use the data to figure out as much as
possible (though it usually won’t be everything) Requires developing search procedures
And then try to narrow the possibilities Use other knowledge (e.g., time order,
interventions) Get better / different data (e.g., run an
experiment)
Always remember…
Even if we cannot discoverthe whole truth,
we might be able to find some of the truth!
Markov equivalence
Formally, we say that: Two causal graphs are members of the
same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables By the Markov and Faithfulness assumptions
Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies
Markov equivalence
The “Fundamental Problem of Causal Inference” can be restated as: For some sets of independence relations,
the Markov equivalence class is not a singleton
Markov equivalence classes give a precise characterization of what can be inferred from independencies alone
Markov equivalence
Examples:
X {Y, Z} ⇒
X Y | Z ⇒
X Y ⇒
XY
ZX
Y
ZX
Y
Z
XY
Z
Y
ZX
Y
ZX
Markov equivalence
Two more examples: Are these graphs Markov equivalent?
Are these two graphs?
XY
ZX
Y
Z
XY
ZX
Y
Z
Shared structure
What is shared by all of the graphs in a Markov equivalence class? Same “skeleton”
I.e., they all have the same adjacency relations Same “unshielded colliders”
I.e., X → Y ← Z with no edge between X and Z Sometimes, other edges have same
direction In these last two cases, we can infer that the
true graph contains the shared directed edges.
Shared structure as patterns Since every Markov equivalent graph
has the same adjacencies, we can represent the whole class using a pattern A pattern is itself a graph, but the edges
represent edges in other graphs
Shared structure as patterns A pattern can have directed and
undirected edges It represents all graphs that can be created
by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider
Let’s try some examples…
Shared structure as patterns
Nitrogen — PlantGrowth — Bees
Nitrogen → PlantGrowth → Bees
Nitrogen ← PlantGrowth → Bees
Nitrogen ← PlantGrowth ← Bees
Shared structure as patterns
Nitrogen → PlantGrowth ← Bees
Nitrogen → PlantGrowth ← Bees
Formal problem of search
Given some dataset D, find: Markov equivalence class, represented as a
pattern P, that predicts exactly the independence relations found in the data
More colloquially, find the causal graphs that could have produced data like this
Hard to find a pattern
“Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!”
Big problem: # of independencies to test is super-exponential in # of variables: 2 variables ⇒ 1 test 5 variables ⇒ 80
tests 3 variables ⇒ 6 tests 6 variables ⇒ 240
tests 4 variables ⇒ 24 tests and so on…
General features of causal search Huge model and parameter spaces
Even when we (necessarily) use prior information about the family of probability distributions.
Relevant statistics must be rapidly computed But substantive knowledge about the
domain may restrict the space of alternative models Time order of variables Required cause/effect relationships Existence or non-existence of latent variables
Three schemata for search
Bayesian / score-based Find the graph(s) with highest P(graph |
data) Constraint-based
Find the graph(s) that predict exactly the observed associations and independencies
Combined Get “close” with constraint-based, and then
find the best graph using score-based
Bayesian / score-based
Informally: Give each model an initial score using “prior
beliefs” Update each score based on the likelihood of the
data if the model were true Output the highest-scoring model
Formally: Specify P(M, v) for all models M and possible
parameter values v of M For any data D, P(D | M, v) can easily be calculated P(M | D) ∝ ⎰v P(D | M, v)P(M, v)
Bayesian / score-based
In practice, this strategy is completely computationally intractable There are too many graphs to check them all
So, we use a greedy search strategy Start with an initial graph Iteratively compare the current graph’s score
(∝ posterior probability) with that of each 1- or 2-step modification of that graph By edge addition, deletion or reversal
Bayesian / score-based
Problem #1: Local maxima Often, greedy searches get stuck
Solution: Greedy search over Markov equivalence
classes, rather than graphs (Meek) Has a proof of correctness and convergence
(Chickering) But it gets to the right answer slowly
Bayesian / score-based
Problem #2: Unobserved variables Huge number of graphs Huge number of different
parameterizations No fast, general way to compute likelihoods
from latent variable models
Partial solution: Focus on a small, “plausible” set of models
for which we can compute scores
Constraint-based
Implementation of the earlier idea “Build” the Markov equivalence class that
predicts the pattern of association actually found in the data Compatible with a variety of statistical
techniques Note that we might have to introduce a latent
variable to explain the pattern of statistics Important constraints on search:
Minimize the number of statistical tests Minimize the size of the conditioning sets
(Why?)
Constraint-based
Algorithm step #1: Discover the adjacencies Create the complete graph with undirected
edges Test all pairs X, Y for unconditional
independence Remove X—Y edge if they are independent
Test all adjacent X, Y for independence given single N Remove X—Y edge if they are independent
Test adjacent pairs given two neighbors …
Constraint-based
Algorithm step #2: (Try to) Orient edges “Unshielded triple”: X — C — Y, but X, Y not
adjacent If X & Y independent given S containing C, then C
must be a non-collider Since we have to condition on it to achieve d-separation
If X & Y independent given S not containing C, then C must be a collider Since the path is not active when not conditioning on C
And then do further orientations to ensure acyclicity and nodes being non-colliders
Constraint-based example
Variables are {X, Y, Z, W} Only independencies are:
X Y X W | Z Y W | Z
Constraint-based example
Step 1: Form the complete graph using undirected edges
X
Y Z
W
Constraint-based example
Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent
X Y ⇒
X
Y Z
W
Constraint-based example
Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them
{X, Y} W | Z ⇒
X
Y Z
W
Constraint-based example
Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables
X
Y Z
W
Constraint-based example
Step 5: Orientation For X – Z – Y, since X Y without
conditioning on Z, then make Z a collider Since Z is a non-collider between X and W,
though, we must orient Z – W away from Z
X
Y Z
W
Constraint-based output
Searches that allow for latent variables can also have edges of the form X o→ Y
This indicates one of three possibilities: X → Y At least one unobserved common cause of
X and Y Both of these
Interventions to the rescue?
Interventions helped us solve an earlier equivalence class problem Randomization meant that:
Treatment-Effect association ⇒ T → E
Interventions alter equivalence classes, but don’t make them all into singletons The fundamental problem of search
remains
Before X-intervention
XY
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
XY
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
X
X
XY
Z
After X-intervention
XY
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
XY
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
ZX
Y
Z
X
X
XY
Z
Search with interventions
Search with interventions is the same as search with observations, except We adjust the graphs in the search space to
account for the intervention
For multiple experiments, we search for graphs in every output equivalence class More complicated than this in the real
world due to sampling variation
Example
Observation Y Z | X ⇒
Intervention on X Y {X, Z} ⇒ &
Only possible graph:
XY
ZX
Y
ZX
Y
Z
XY
Z
Y
ZX
XY
Z
Looking ahead…
Have: Basic formal representation for causation Fundamental causal asymmetry (of
intervention) Inference & reasoning methods Search & causal discovery principles
Need: Search & causal discovery methods that
work in the real world