A menu of topics
Some real-world challenges: Convergence & error bounds Sample selection bias Simpson’s paradox
Some real-world successes: Learning based on more than just
independence Learning about latents & their structure
Short-run causal search
Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies Of course, this is a problem for all
inference: we might just be really unlucky Note: This is not (really) the problem of
unrepresentative samples (e.g., black swans)
Convergence in search
In search, we would like to bound our possible error as we acquire data I.e., we want search procedures that have
uniform convergence Without uniform convergence,
Cannot set confidence intervals for inference
Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose
Pointwise convergence
Assume hypothesis H is true Then
For any standard of “closeness” to H, and For any standard of “successful refutation,” For every hypothesis that is not “close” to
H, there is a sample size for which that hypothesis is refuted
Uniform convergence
Assume hypothesis H is true Then
For any standard of “closeness” to H, and For any standard of “successful refutation,” There is a sample size such that for all
hypotheses H* that are not “close” to H, H* is refuted at that sample size.
Two theorems about convergence There are procedures that, for every
model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993)
There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)
Two theorems about convergence What if we didn’t care about “small” causes? ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε Every association predicted by d-connection is ≥ε
For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness So we have error bounds, confidence intervals,
etc.
Sample selection bias
Sometimes, a variable of interest is a cause of whether people get in the sample E.g., measure various skills or knowledge in
college students Or measuring joblessness by a phone
survey during the middle of the day
Simple problem: You might get a skewed picture of the population
Sample selection bias
If two variables matter, then we have:
Sample = 1 for everyone we measure That is equivalent to conditioning on
Sample ⇒ Induces an association between A and B!
Factor A Factor B
Sample
Simpson’s Paradox
Consider the following data:Men Women
P(A | T) = 0.5 P(A | T) = 0.39
P(A | U) = 0.45… P(A | U) = 0.333
Treated Untreated
Alive 3 20
Dead 3 24
Treated Untreated
Alive 16 3
Dead 25 6
Treatment is superior in both groups!
Simpson’s Paradox
Consider the following data:Pooled
P(A | T) = 0.404P(A | U) = 0.434
Treated Untreated
Alive 19 23
Dead 28 30In the “full” population, you’re better off not being Treated!
More than independence
Independence & association can reveal only the Markov equivalence class But our data contain more statistical
information!
Algorithms that exploit this additional info can sometimes learn more (including unique graphs) Example: LiNGaM algorithm for non-
Gaussian data
Non-Gaussian data
Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:D = B D + ε
where B is permutable to lower triangular (because graph is acyclic)
Non-Gaussian data
Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:D = Aε
where A = (I – B)-1
Non-Gaussian data
Assume linearity & independent non-Gaussian noise
Linear causal DAG functions are:D = Aε
where A = (I – B)-1
ICA is an efficient estimator for A ⇒ Efficient causal search that reveals
direction! C ⟶ E iff non-zero entry in A
Non-Gaussian data
Why can we learn the directions in this case?
A B
A B
Gaussian noise Uniform noise
Learning about latents
Sometimes, our real interest… is in variables that are only indirectly
observed or observed by their effects or unknown altogether but influencing
things behind the scenes
Test score
Reading level
Math skills
General IQSociability
Size of social network
Other factors
Factor analysis
Assume linear equations Given some set of (observed) features,
determine the coefficients for (a fixed number of) unobserved variables that minimize the error
Factor analysis
If we have one factor, then we find coefficients to minimize error in:
Fi = ai + biU
where U is the unobserved variable (with fixed mean and variance)
Two factors ⇒ Minimize error in:Fi = ai + bi,1U1 + bi,2U2
Factor analysis
Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff
Also, the interpretation of the unobserved factors must be provided by the scientist The data do not dictate the meaning of the
unobserved factors (though it can sometimes be “obvious”)
Factor analysis as graph search One-variable factor analysis is equivalent
to finding the ML parameter estimates for the SEM with graph:
F1 FnF2
U
…
Factor analysis as graph search Two-variable factor analysis is
equivalent to finding the ML parameter estimates for the SEM with graph:
F1 FnF2
U1
…
U2
Better methods for latents
Two different types of algorithms:1. Determine which observed variables are
caused by shared latents BPC, FOFC, FTFC, …
2. Determine the causal structure among the latents
MIMBuild
Note: need additional parametric assumptions Usually linearity, but can do it with weaker info
Discovering latents
Key idea: For many parameterizations, association between X & Y can be decomposed Linearity ⇒
⇒ can use patterns in the precise associations to discover the number of latents Using the ranks of different sub-matrices
Discovering latents
Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc.
And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents
Other Algorithms
CCD: Learn DCG (with non-obvious semantics) ION: Learn global features from overlapping
local sets (including between not co-measured variables)
SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints
LoSST: Learn causal structure while that structure potentially changes over time
And lots of other ongoing research!