Distinguishing between Cause and Effect: Estimation of ...clopinet.com/isabelle/Projects/NIPS2013/slides/Peters_NIPS2013.pdf · Y non-Gaussian X Y Then there is no X = ˚Y + N X N

Distinguishing between Cause and Effect:Estimation of Causal Graphs with two Variables

Jonas PetersETH Zurich

Tutorial

NIPS 2013 Workshop on Causality9th December 2013

F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

Jonas Peters (ETH Zurich) Distinguishing between Cause and Effect 9th December 2013





Problem: Given P(X ,Y ), can we infer whether

X → Y or Y → X ?

Difficulty: So much symmetry:

P(X ) · P(Y |X ) = P(X ,Y ) = P(X |Y ) · P(Y )

We need assumptions!! (e.g. Markov and faithfulness do not suffice.)

Surprise (for some assumptions):

2 variables ⇒ p variables

J. Peters, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, arXiv:1309.6779



X → Y or Y → X ?


P(X ) · P(Y |X ) = P(X ,Y ) = P(X |Y ) · P(Y )







X → Y or Y → X ?


P(X ) · P(Y |X ) = P(X ,Y ) = P(X |Y ) · P(Y )






Idea No. 1: Linear Non-Gaussian Additive Models (LiNGAM)

Structural assumptions like additive non-Gaussian noise models break thesymmetry:

Y = βX + NY NY ⊥⊥ X ,

with NY non-Gaussian.


Asymmetry No. 1

Consider a distribution corresponding to

Y = βX + NY

• NY ⊥⊥ X• NY non-Gaussian

X Y

Then there is no

X = φY + NX

• NX ⊥⊥ Y• NX non-Gaussian

X Y

S. Shimizu, P.O. Hoyer, A. Hyvarinen and A.J. Kerminen: A linear non-Gaussian acyclic model for causal discovery, JMLR 2006


Asymmetry No. 1


Y = βX + NY

• NY ⊥⊥ X• NY non-Gaussian

X Y

Then there is no

X = φY + NX

• NX ⊥⊥ Y• NX non-Gaussian

X Y

S. Shimizu, P.O. Hoyer, A. Hyvarinen and A.J. Kerminen: A linear non-Gaussian acyclic model for causal discovery, JMLR 2006


Idea No. 2: Additive noise models

Nonlinear functions are also fine!

Y = f (X ) + NY NY ⊥⊥ X

P. Hoyer, D. Janzing, J. Mooij, J. Peters and B. Scholkopf: Nonlinear causal discovery with additive noise models, NIPS 2008


Asymmetry No. 2


Y = f (X ) + NY

with NY ⊥⊥ X

X Y

Then for “most combinations” (f ,P(X ),P(NY )) there is no

X = g(Y ) + MX

with MX ⊥⊥ Y

X Y


Asymmetry No. 2


Y = f (X ) + NY

with NY ⊥⊥ X

X Y

Then for “most combinations” (f ,P(X ),P(NY )) there is no

X = g(Y ) + MX

with MX ⊥⊥ Y

X Y


Y = f (X ) + NY , NY ⊥⊥ X


Y = f (X ) + NY , NY ⊥⊥ X


X = g(Y ) + NX , NX ⊥⊥ Y


X = g(Y ) + NX , NX ⊥⊥ Y


Idea No. 3: Gaussian Process Inference (GPI)

We can always write∗

Y = f (X ,NY ), NY ⊥⊥ X

andX = g(Y ,NX ), NX ⊥⊥ Y

Which model is more “complex”? Use Bayesian model comparison.

J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, B. Scholkopf:

Probabilistic latent variable models for distinguishing between cause and effect, NIPS 2010

∗E.g., J. Peters: Restricted Structural Equation Models for Causal Inference, PhD Thesis


Asymmetry No. 3

1 Fix the noise distribution to be N (0, 1).

2 Put prior p(θX ) on input distribution p(x | θX ) ( complexity of X ).

3 Put prior p(θf ) on the functions p(f | θf ) ( complexity of f ).

4 Approximate marginal likelihood for X → Y

p(x , y) = p(x) · p(y | x)

=

∫p(x | θX )p(θX ) dθX

·∫δ(y − f (x , e)

)p(e)p(f ) de df

θf

f

Y

X

θX

E

5 Approximate marginal likelihood for Y → X .

6 Compare.




Asymmetry No. 3

1 Fix the noise distribution to be N (0, 1).

2 Put prior p(θX ) on input distribution p(x | θX ) ( complexity of X ).

3 Put prior p(θf ) on the functions p(f | θf ) ( complexity of f ).

4 Approximate marginal likelihood for X → Y

p(x , y) = p(x) · p(y | x)

=

∫p(x | θX )p(θX ) dθX

·∫δ(y − f (x , e)

)p(e)p(f ) de df

θf

f

Y

X

θX

E

5 Approximate marginal likelihood for Y → X .

6 Compare.




Idea No. 4: Information Geometric Causal Inference (IGCI)

Assume a deterministic relationship

Y = f (X )

and that f and P(X ) are “independent”.

D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Scholkopf:

Information-geometric approach to inferring causal directions, Artificial Intelligence 2012

y

x

f(x)

p(x)

p(y)


Idea No. 4: Information Geometric Causal Inference (IGCI)

Assume a deterministic relationship

Y = f (X )

and that f and P(X ) are “independent”.



y

x

f(x)

p(x)

p(y)


Asymmetry No. 4

Consider Y = f (X ) with id 6= f : [0, 1]→ [0, 1] invertible and X = g(Y ).If

“cov”(log f ′, pX ) =

∫log(f ′(x)) pX (x) dx −

∫log f ′(x) dx = 0

then

“cov”(log g ′, pY ) =

∫log(g ′(y)) pY (y) dy −

∫log g ′(y) dy > 0




Open Questions 1: Quantifying Identifiability





Proposition

Assume P(X ,Y ) is generated by

Y = f (X ) + NY

with independent X and NY .

Theninf

Q∈{Q:Y→X}KL(P ||Q) = ?

first steps to understand the geometry

gives us finite sample guarantees


Open Questions 2: Robustness

What happens if assumptions are violated? E.g., in case of confounding?

X Y

Z

Can we still infer X → Y ? How useful is this?


Conclusions

In theory, we can brake asymmetry between cause and effect.

restricted structural equation models:- linear functions, additive non-Gaussian noise- nonlinear functions, additive noise

complexity measures on functions and distributions

“independence” between function and input distribution

... principles behind new methods from challenge?

Causal inference prob-

lem of climate change is

solved! Fight the cause!

Don’t fly! (Zurich-SFO 5.4t CO2)!

Compensate!


Conclusions










Compensate!


Conclusions










Compensate!


IGCI

It turns out that if X → Y∫log |f ′(x)|p(x) dx <

∫log |g ′(y)|p(y) dy

Estimator:

CX→Y :=1

m

m∑j=1

log

∣∣∣∣yj+1 − yjxj+1 − xj

∣∣∣∣ ≈ ∫ log |f ′(x)|p(x) dx

Infer X → Y ifCX→Y < CY→X


Y = βX + NY , NY ⊥⊥ X , NY non-Gaussian

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

X

Y


Y = βX + NY , NY ⊥⊥ X , NY non-Gaussian

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

X

Y


X = φY + NX , NX ⊥⊥ Y , NX non-Gaussian

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

X

Y


X = φY + NX , NX ⊥⊥ Y , NX non-Gaussian

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

X

Y


Does X cause Y or vice versa?Real Data


Does X cause Y or vice versa?Real Data


Does X cause Y or vice versa?

No (not enough) data for chocolate

... but we have data for coffee!



No (not enough) data for chocolate

... but we have data for coffee!



0 2 4 6 8 10 12

05

15

25

coffee consumption per capita (kg)

# N

ob

el L

au

rea

tes /

10

mio

Correlation: 0.698, p-value: < 2.2 · 10−16.

Nobel Prize→ Coffee: Dependent residuals (p-value of 0).Coffee→ Nobel Prize: Dependent residuals (p-value of 0).

⇒ Model class too small? Causally insufficient?



0 2 4 6 8 10 12

05

15

25

coffee consumption per capita (kg)

# N

ob

el L

au

rea

tes /

10

mio

Correlation: 0.698, p-value: < 2.2 · 10−16.

Nobel Prize→ Coffee: Dependent residuals (p-value of 0).Coffee→ Nobel Prize: Dependent residuals (p-value of 0).

⇒ Model class too small? Causally insufficient?Jonas Peters (ETH Zurich) Distinguishing between Cause and Effect 9th December 2013

The linear Gaussian case

Y = βX + NY

with independent

X ∼ N (0, σ2X ) and

N ∼ N (0, σ2NY

) .

Then there is a linear SEM with

X = αY + MX

How can we find α and MX ?

L2

NYY

XβX


Documents

Distinguishing between Cause and Effect: Estimation of ...clopinet.com/isabelle/Projects/NIPS2013/slides/Peters_NIPS2013.pdf · Y non-Gaussian X Y Then there is no X = ˚Y + N X N