67
Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter B¨ uhlmann Seminar f ¨ ur Statistik, ETH Z ¨ urich joint work with Jonas Peters MPI T ¨ ubingen Nicolai Meinshausen SfS ETH Z ¨ urich

Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Blessing of heterogeneous large-scale datafor high-dimensional causal inference

Peter Buhlmann

Seminar fur Statistik, ETH Zurich

joint work with

Jonas PetersMPI Tubingen

Nicolai MeinshausenSfS ETH Zurich

Page 2: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Heterogeneous large-scale data

Big Data

the talk is not (yet) on “really big data”

but we will take advantage of heterogeneityoften arising with large-scale data where

i.i.d./homogeneity assumption is not appropriate

Page 3: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

causal inference = intervention analysis

in genomics (for yeast or plants):if we would make an intervention at a single (or many) gene(s),what would be its (their) effect on a response of interest?

want to infer/predict such effects without actually doing theinterventione.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)(from observations of a “steady-state system”)

or, as is mainly our focus: fromI observational and interventional data with well-specified

interventionsI changing environments or experimental settings with

“vaguely” or “un-” specified interventionsthat is, from “heterogeneous” data

Page 4: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

causal inference = intervention analysis

in genomics (for yeast or plants):if we would make an intervention at a single (or many) gene(s),what would be its (their) effect on a response of interest?

want to infer/predict such effects without actually doing theinterventione.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)(from observations of a “steady-state system”)

or, as is mainly our focus: fromI observational and interventional data with well-specified

interventionsI changing environments or experimental settings with

“vaguely” or “un-” specified interventionsthat is, from “heterogeneous” data

Page 5: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest:Y = days to bolting (flowering)“covariates” X = gene expressions from p = 21′326 genes

question: infer/predict the effect of knocking-out a single geneon the phenotype/response variable Y?

using statistical method based on n = 47 observational data; validated the top-predictions with randomized experiments

with some moderate success(Stekhoven, Moraes, Sveinbjornsson, Hennig, Maathuis & PB, 2012)

Page 6: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest:Y = days to bolting (flowering)“covariates” X = gene expressions from p = 21′326 genes

question: infer/predict the effect of knocking-out a single geneon the phenotype/response variable Y?

using statistical method based on n = 47 observational data; validated the top-predictions with randomized experiments

with some moderate success(Stekhoven, Moraes, Sveinbjornsson, Hennig, Maathuis & PB, 2012)

Page 7: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

2. Gene expressions of yeast

p = 5360 genesphenotype of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes

and thenphenotype of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes

and so on

infer/predict the effects of single gene deletions on all othergenes

Page 8: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Effects of single gene deletions on all other genes (yeast)

(Maathuis, Colombo, Kalisch & PB, 2010)• p = 5360 genes (expression of genes)• 231 single gene deletions ; 1.2 · 106 intervention effects• the truth is “known in good approximation”

(thanks to intervention experiments)

goal: prediction of the true large intervention effectsbased on observational data with no knock-downs

n = 63observational data

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1,000 IDALassoElastic−netRandom

Page 9: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

REGRESSION

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1,000 IDALassoElastic−netRandom

Page 10: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

REGRESSION

because: for

Y =

p∑j=1

βjX (j) + ε

βj measures effect of X (j) on Y whenkeeping all other variables {X (k); k 6= j} fixed

but when doing an intervention at a gene ; some/many othergenes might change as well and cannot be kept fixed

causal inference framework: allows to definedynamic notion of intervention effect“without keeping other variables fixed”

Page 11: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

False positives

True

pos

itive

s

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1,000 IDALassoElastic−netRandom

“predictions of single genedeletions effects in yeast”

was a good finding for a difficultproblem...

... but some criticisms apply:I not very “robust”:

depends somewhat how we define the “truth”(has been found recently by J. Mooij, J. Peters and others)

I old, publicly available data (Hughes et al., 2000)I no collaborator... just (publicly available︸ ︷︷ ︸

this is good!

) data

I we didn’t use the interventional data for training

Page 12: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

A new and “better” attempt for the “same” problemsingle gene knock-down in yeast and measure genomewideexpression (observational and interventional data)

goal: predict unseen gene knock-down effectsbased on both observational and interventional data

collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)

data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)

Page 13: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Graphical and Structural equation modelslate 1980s: Pearl; Spirtes, Glymour, Scheines; Dawid; Lauritzen;. . .

powerful language of graphs make formulations and modelsmore transparent (Evans and Richardson, 2011)

Page 14: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

variables X1, . . . ,Xp+1 (Xp+1 = Y is the response of interest)

directed acyclic graph (DAG) D0 encoding the true underlyingcausal influence diagram

structural equation model (SEM):

Xj ← f 0j (XpaD0 (j), εj), j = 1, . . . ,p + 1,

ε1, . . . , εp+1 independent

e.g. linear Xj ←∑

k∈paD0 (j)

β0jkXk + εj j = 1, . . . ,p + 1

causal variables for Y = Xp+1: S0 = {j ; j ∈ paD0(Y )}causal coefficients for Y in linear SEM: β0

Yj , j ∈ paD0(Y )

Page 15: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

severe issues of identifiability !

(X ,Y ) ∼ N2(0,Σ)

X XY Y

X causes Y Y causes X

Page 16: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

agenda for estimation(Chickering, 2002; Shimizu, 2005; Kalisch & PB, 2007;... )

1. estimate the Markov equivalence class of DAGs D0: Dsevere issues of identifiability !

2. derive causal variables: the ones which are causal in allDAGs from D; derive bounds for causal effects based on D(Maathuis, Kalisch & PB, 2009)

goal: construction of confidence statements(without knowing the structure of the underlying graph)

problem:direct likelihood-based inference:

(as in e.g. Mladen Kolar’s talk for undirected graphs)difficult, because of severe identifiability issues !

Page 17: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

nevertheless: our goal is to construct a procedureS ⊆ {1, . . . ,p} such that

P[ S ⊆ S0︸ ︷︷ ︸no false positives

] ≥ 1− α

without the need to specify identifiable componentsthat is: if a variable Xj cannot be identified to be causal⇒ j /∈ S

we do this with an entirely new framework(which might be “much simpler” for causal inference)NOT or AVOIDINGgraphical model fitting, potential outcome models,...and maybe more accessible to experts in regression modeling?

Page 18: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

nevertheless: our goal is to construct a procedureS ⊆ {1, . . . ,p} such that

P[ S ⊆ S0︸ ︷︷ ︸no false positives

] ≥ 1− α

without the need to specify identifiable componentsthat is: if a variable Xj cannot be identified to be causal⇒ j /∈ S

we do this with an entirely new framework(which might be “much simpler” for causal inference)NOT or AVOIDINGgraphical model fitting, potential outcome models,...and maybe more accessible to experts in regression modeling?

Page 19: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Causal inference using invariant prediction

Peters, PB and Meinshausen (2015)

a main message:

causal structure/components remain the samefor different sub-populations

while the non-causal components can change acrosssub-populations

thus:; look for “stability” of structures among

different sub-populations

Page 20: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Causal inference using invariant prediction

Peters, PB and Meinshausen (2015)

a main message:

causal structure/components remain the samefor different sub-populations

while the non-causal components can change acrosssub-populations

thus:; look for “stability” of structures among

different sub-populations

Page 21: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

goal:find the causal variables (components) among a p-dimensionalpredictor variable X for a specific response variable Y

consider data

(X e,Y e) ∼ F e, e ∈ E

with response variables Y e and predictor variables X e

here: e ∈ E︸︷︷︸space of exp. sett.

denotes an experimental setting

“heterogeneous” data fromdifferent environments/experiments e ∈ E

(aspect of “Big Data”)

Page 22: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

goal:find the causal variables (components) among a p-dimensionalpredictor variable X for a specific response variable Y

consider data

(X e,Y e) ∼ F e, e ∈ E

with response variables Y e and predictor variables X e

here: e ∈ E︸︷︷︸space of exp. sett.

denotes an experimental setting

“heterogeneous” data fromdifferent environments/experiments e ∈ E

(aspect of “Big Data”)

Page 23: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

data

(X e,Y e) ∼ F e, e ∈ E

example 1: E = {1,2} encoding observational (1) and allpotentially unspecific interventional data (2)

example 2: E = {1,2} encoding observational data (1) and(repeated) data from one specific intervention (2)

example 3: E = {1,2,3} ... or E = {1,2,3, . . . ,26} ...

do not need data from carefullydesigned (randomized) experiments

Page 24: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Invariance Assumption

for a set S∗ ⊆ {1, . . . ,p}: L(Y e|X eS∗) is invariant across e ∈ E

for linear model setting: there exists a vector γ∗ such that

∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X eS∗ , S∗ = {j ; γ∗j 6= 0}

εe ∼ Fε the same for all eX e has an arbitrary distribution, different across e

γ∗, S∗ is interesting in its own right!

namely the parameter and structure which remain invariantacross experimental settings, or across heterogeneous groups

Page 25: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Invariance Assumption

for a set S∗ ⊆ {1, . . . ,p}: L(Y e|X eS∗) is invariant across e ∈ E

for linear model setting: there exists a vector γ∗ such that

∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X eS∗ , S∗ = {j ; γ∗j 6= 0}

εe ∼ Fε the same for all eX e has an arbitrary distribution, different across e

γ∗, S∗ is interesting in its own right!

namely the parameter and structure which remain invariantacross experimental settings, or across heterogeneous groups

Page 26: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Moreover: link to causality

assume:

in short : L(Y e|X eS∗) is invariant across e ∈ E

Proposition (Peters, PB & Meinshausen, 2015)If E does not affect the structural equation for Y in a SEM:

e.g. linear SEM: Y e ←∑

k∈pa(Y )

βYk︸︷︷︸∀e

X ek + εe

Y︸︷︷︸∼Fε∀e

then S0 = pa(Y )︸ ︷︷ ︸causal var.

satisfies the Invariance Assumption (w.r.t. E)

the causal variables lead to invariance (of conditional distr.)

Page 27: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Moreover: link to causality

assume:

in short : L(Y e|X eS∗) is invariant across e ∈ E

Proposition (Peters, PB & Meinshausen, 2015)If E does not affect the structural equation for Y in a SEM:

e.g. linear SEM: Y e ←∑

k∈pa(Y )

βYk︸︷︷︸∀e

X ek + εe

Y︸︷︷︸∼Fε∀e

then S0 = pa(Y )︸ ︷︷ ︸causal var.

satisfies the Invariance Assumption (w.r.t. E)

the causal variables lead to invariance (of conditional distr.)

Page 28: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

if E does not affect structural equation for Y :

S0 = pa(Y ) satisfies the Invariance Assumption

this assumption holds for example for:I do-intervention (Pearl) at variables different than YI noise (or “soft”) intervention (Eberhardt & Scheines, 2007)

at variables different than Y

in addition: there might be many other S∗ satisfying theInvariance Assumptionbut uniqueness is not really important (see later)

Page 29: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

how do we know whetherE is not affecting structural equation for Y ?

if E does affect structural equation for Y :we will argue:

“robustness” of our procedure (proposed later)

; no causal statementsno false positivesconservative, but on the safe side

Page 30: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

how do we know whetherE is not affecting structural equation for Y ?

if E does affect structural equation for Y :we will argue:

“robustness” of our procedure (proposed later)

; no causal statementsno false positivesconservative, but on the safe side

Page 31: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Invariance Assumption: plausible to hold with real data

two-dimensional conditional distributions of observational (blue)and interventional (orange) data(no intervention at displayed variables X ,Y )

seeminglyno invarianceof conditional d.

plausibleinvarianceof conditional d.

Page 32: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

A procedure: population caserequire and exploit the Invariance Assumption

L(Y e|X eS∗) the same across e ∈ E

H0,γ,S(E) : γk = 0 if k /∈ S and∃Fε such that ∀ e ∈ E :

Y e = X eγ + εe, εe ⊥ X eS , ε

e ∼ Fε the same for all e

if H0,γ,S(E) is true:I model is correctI S, γ are plausible causal variables/predictors and

coefficientsand

H0,S(E) : there exists γ such that H0,γ,S(E) holds

S is called “plausible causal predictors” if H0,S(E) holds

Page 33: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

A procedure: population caserequire and exploit the Invariance Assumption

L(Y e|X eS∗) the same across e ∈ E

H0,γ,S(E) : γk = 0 if k /∈ S and∃Fε such that ∀ e ∈ E :

Y e = X eγ + εe, εe ⊥ X eS , ε

e ∼ Fε the same for all e

if H0,γ,S(E) is true:I model is correctI S, γ are plausible causal variables/predictors and

coefficientsand

H0,S(E) : there exists γ such that H0,γ,S(E) holds

S is called “plausible causal predictors” if H0,S(E) holds

Page 34: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

identifiable causal predictors under E :

is defined as the set S(E), where

S(E) =⋂{ S; H0,S(E) holds︸ ︷︷ ︸plausible causal predictors

}

the intersection of all plausible causal predictors

under the Invariance Assumption we have: for any S∗,

S(E) ⊆ S∗

and this is key to obtain confidence bounds for identifiablecausal predictors

Page 35: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

identifiable causal predictors under E :

is defined as the set S(E), where

S(E) =⋂{ S; H0,S(E) holds︸ ︷︷ ︸plausible causal predictors

}

the intersection of all plausible causal predictors

under the Invariance Assumption we have: for any S∗,

S(E) ⊆ S∗

and this is key to obtain confidence bounds for identifiablecausal predictors

Page 36: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

we have by definition:

S(E1) ⊆ S(E2) if E1 ⊆ E2

withI more interventionsI more “heterogeneity”I more “diversity in complex data”

we can identify more causal predictors

Page 37: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

identifiable causal predictors : S(E)↗ as E ↗

question: when is S(E) = S0 ?but it is not important that it is “equal to” or unique(see later)

Page 38: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Theorem (Peters, PB and Meinshausen, 2015)

S(E) = S0 = (parental set of Y in the causal DAG)

if there is:I a single do-intervention for each variable other than Y and |E| = pI a single noise intervention for each variable other than Y and |E| = pI a simultaneous noise intervention and |E| = 2

the conditions can be relaxed such that it is not necessary to intervene at allthe variables

Page 39: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Statistical confidence sets for causal predictors

“the finite sample version of S(E) =⋂

S{S; H0,S(E) is true}”

for “any” S ⊆ {1, . . . ,p}:test whether H0,S(E) is accepted or rejected

S(E) =⋂S

{H0,S accepted at level α}

Page 40: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

for H0,S(E):test constancy of regression param. and its residual error distr.across e ∈ E

by weakening H0,S(E) to H0,S(E):

∃β, σ : βepred(S) ≡ β, σe

pred(S) ≡ σ ∀e ∈ E

where

βepred(S) = argminβ;βk=0 (k /∈S)E|Y e − X eβ|2,

σepred(S) = (E|Y e − X eβe

pred(S)|2)1/2

note: H0,S(E) true =⇒ H0,S(E) true

Page 41: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

testing H0,S(E): assuming Gaussian errors

DT Σ−1D D

σ2ne∼ F (ne,n−e − |S| − 1)

D = Y e − Y e, Y e based on data \{e}ΣD = Ine + Xe,S(X T

−e,SX−e,S)−1X Te,S

reject H0,S(E) if p-value < α/|E|

Page 42: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

S(E) =⋂S

{H0,S accepted at level α}

for some significance level 0 < α < 1

going through all sets S?

Page 43: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

S(E) =⋂S

{H0,S accepted at level α}

for some significance level 0 < α < 1

going through all sets S?

Page 44: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

going through all sets S?

1. start with S = ∅: if H0,∅(E) accepted =⇒ S(E) = ∅2. consider small sets S of cardinality 1,2, . . .

and construct corresponding intersections S∩ withpreviously considered accepted sets S (H0,S(E) accepted)

for S with H0,S accepted :

S∩ ← S∩ ∩ S

if intersection S∩ = ∅ =⇒ S(E) = ∅if not:

discard all S with S ⊇ S∩

and continue with the remaining sets3. for large p:

restrict search space by variables from Lasso regression;need a faithfulness assumption (and sparsity and assumptionson X e for justification)

Page 45: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

confidence sets with invariant prediction

1. for each S ⊆ {1, . . . ,p} construct a set ΓS(E) as follows:

if H0,S(E) rejected at α/(2|E|) ; set ΓS(E) = ∅otherw. ΓS(E) = classic. (1− α/2) CI for βS based on XS

2. set

Γ(E) =⋃

S⊆{1,...,p}

ΓS(E) (est. plausible causal coeff.)

Page 46: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

we then obtain:

confidence set for S∗ : S(E)

confidence set for γ∗ : Γ(E)

Page 47: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Theorem (Peters, PB and Meinshausen, 2015)

assume: linear model, Gaussian errors

then, for any γ∗,S∗ satisfying the Invariance Assumption:

P[S(E) ⊆ S∗] ≥ 1− α : confidence w.r.t. true causal var.P[γ∗ ∈ Γ(E)] ≥ 1− α : confidence set for true causal param.

and can choose S∗ = S0 = causal variables in linear SEMif E does not affect struct. eqn. of Y

“on the safe side” (conservative)

we do not need to care about identifiability: if the effect is notidentifiable, the method will not wrongly claim an effect

we do not require that S∗ is minimal and unique

Page 48: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Theorem (Peters, PB and Meinshausen, 2015)

assume: linear model, Gaussian errors

then, for any γ∗,S∗ satisfying the Invariance Assumption:

P[S(E) ⊆ S∗] ≥ 1− α : confidence w.r.t. true causal var.P[γ∗ ∈ Γ(E)] ≥ 1− α : confidence set for true causal param.

and can choose S∗ = S0 = causal variables in linear SEMif E does not affect struct. eqn. of Y

“on the safe side” (conservative)

we do not need to care about identifiability: if the effect is notidentifiable, the method will not wrongly claim an effect

we do not require that S∗ is minimal and unique

Page 49: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

“the first” result on statistical confidence for potentiallynon-identifiable causal predictors when structure is unknown(route via graphical modeling for confidence sets seems awkward)

leading to (hopefully) morereliable causal inferential statements

Page 50: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

“Robustness”

if Invariance Assumption does not holdbecause E has a direct effect on Y

; for all S: L(Y e|X eS) is not invariant across e ∈ E

; S(E) = ∅ (at least as n→∞)still on the safe side but no power

Page 51: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Empirical results: simulations100 different scenarios, 1000 data sets per scenario:|E| = 2, nobs = ninterv ∈ {100, . . . ,500}, p ∈ {5, . . . ,40}

SU

CC

ES

S P

RO

BA

BIL

ITY

Mar

gina

lM

argi

nal

● ●

●●

●●

●●

●●

● ●

●●

●●

Reg

ress

ion

Reg

ress

ion

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

Ges

Ges

● ●

● ●

●●

●●

● ●

Gie

s (u

nkno

wn)

Gie

s (u

nkno

wn)

●●

●●●

Gie

s (k

now

n)G

ies

(kno

wn) ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

Ling

amLi

ngam

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

Inva

riant

pre

dict

ion

Inva

riant

pre

dict

ion

0.00

0.25

0.50

0.75

1.00

power todetect causal predictors

FW

ER

Mar

gina

l Mar

gina

l

● ●

●●

●●

●●

● ●

●●

●●

●●

Reg

ress

ion R

egre

ssio

n

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

Ges

Ges

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Gie

s (u

nkno

wn)

Gie

s (u

nkno

wn)

●●

●●

●●

Gie

s (k

now

n)G

ies

(kno

wn)

●●

●●

●●

●● ●

● ●

●●

●●

●●

Ling

amLi

ngam

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●● ●●● ●● ●●● ● ●●●● ●●● ●●●●●● ●●

●●●●

●●●● ●●●●● ●●● ●● ●●● ●●

●● ●● ● ●●

●●● ●●● ●●

●●● ●● ●●●●● ●●● ●●● ● ●●●●●

●● ●● ●● ●● ●●

Inva

riant

pre

dict

ion In

varia

nt p

redi

ctio

n

0.00

0.50

1.00

familywise error rate:P[S(E) 6⊆ S∗], aimed at 0.05

Page 52: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Single gene deletion experiments in yeast

p = 6170 genesresponse of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes

and thenresponse of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes

and so on

infer/predict the effects of a single gene knock-down on allother genes

Page 53: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)

data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)

Page 54: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Kemmeren et al. (2014):genome-wide mRNA expressions in yeast: p = 6170 genes

I nobs = 160 “observational” samples of wild-typesI nint = 1479 “interventional” samples

each of them corresponds to a single gene deletion strain

for our method:I we use |E| = 2 (observational and interventional data)I training-test data:• training: all observational and 2/3 of interventional data• test: other 1/3 of gene deletion interventions• repeat this for the three blocks of interventional data

I since every interventional data point is used once as aresponse variable:we use coverage 1− α/nint with α = 0.01 and nint = 1479

Page 55: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Results

8 genes are significant (α = 0.01 level) causal variables(each of the 8 genes “causes” another gene)

validation with test datamethod invar.pred. GIES PC-IDA marg.corr. rand.guess.

no. true pos. 6 2 2 2 *(out of 8)

*: quantiles for selecting true positives among 7 random draws2 (95%), 3 (99%)

; our invariant prediction method has most power !and it should exhibit control against false positive selections

Page 56: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

# INTERVENTION PREDICTIONS

# S

TR

ON

G IN

TE

RV

EN

TIO

N E

FF

EC

TS

0 5 10 15 20 25

02

46

8

PERFECTINVARIANTHIDDEN−INVARIANTPCRFCIREGRESSION (CV−Lasso)GES and GIESRANDOM (99% prediction− interval)

I : invariant prediction methodH: invariant prediction with some hidden variables

Page 57: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Validation (Meinshausen, Hauser, Mooij, Peters, Versteeg & PB, 2015)with intervention experiments: strong intervention effect (SIE)with yeastgenome.org database: scores A-F

rank cause effect SIE A B C D E F1 YMR104C YMR103C X2 YPL273W YMR321C X3 YCL040W YCL042W X4 YLL019C YLL020C X5 YMR186W YPL240C X X X X X X6 YDR074W YBR126C X X X X X X7 YMR173W YMR173W-A X8 YGR162W YGR264C9 YOR027W YJL077C X

10 YJL115W YLR170C11 YOR153W YDR011W X X12 YLR270W YLR345W13 YOR153W YBL005W14 YJL141C YNR007C15 YAL059W YPL211W16 YLR263W YKL098W17 YGR271C-A YDR339C18 YLL019C YGR130C19 YCL040W YML100W20 YMR310C YOR224C

SIE: correctly predicting a strong intervention effect which is inthe 1%- or 99% tail of the observational data

Page 58: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Flow cytometry data (Sachs et al., 2005)

I p = 11 abundances of chemical reagentsI 8 different environments (not “well-defined” interventions)

(one of them observational; 7 different reagents added)I each environment contains ne ≈ 700− 1′000 samples

goal:recover network of causal relations (linear SEM)

Raf

Mek

PLCg

PIP2

PIP3

Erk

Akt

PKA

PKC

p38

JNK

approach: invariant causal prediction(one variable the response Y ; the other 10 the covariates X ;

do this 11 times with every variable once the response)

main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail

Page 59: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail

instead of requiring invariance among 8 environments; require invariance for pairs of environments Eij (28 pairs intotal) and do Bonferroni correction by taking the union

S(E) = ∪i<j Sα/28(Eij)

this is weakening the assumption that an intervention shouldnot directly influence the response Y

and if a pair of interventions does nevertheless; S = ∅ (“robustness” as discussed before)

Page 60: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Raf

Mek

PLCg

PIP2

PIP3

Erk

Akt

PKA

PKC

p38

JNK

blue edges: only invariant causal prediction approach (ICP)red: only ICP allowing hidden variables and feedbackpurple: both ICP with and without hidden variablessolid: all relations that have been reported in literaturebroken: new findings not reported in the literature

; reasonable consensus with existing resultsbut no real ground-truth available

serves as an illustration that we can work with “vaguely definedinterventions”

Page 61: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Concluding thoughts

generalize Invariance Assumption and statistical testing tononparametric/nonlinear modelsin particular additive models

∀e ∈ E : Y e = f ∗(X eS∗) + εe, εe ∼ Fε, εe ⊥ XS∗

∀e ∈ E : Y e =∑j∈S∗

f ∗j (X ej ) + εe, εe ∼ Fε, εe ⊥ XS∗

the statistical significance testing becomes more difficultimproved identifiability with nonlinear SEMs (Mooij et al., 2009)

Page 62: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

generalize to include hidden variables

S0 = pa(Y )∩observed variables, S0H = pa(Y )∩hidden variables

presented procedure still OK if (essentially):- no interventions at Y , at S0

H and ancestors(S0H)

- no hidden confounder between Y and S0

Y

H1 H2H3

S0X1

X8

X12

Page 63: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

more general hidden variable models can be treated with a“related” technique

(Rothenhausler, Heinze, Peters & Meinshausen, 2015)

I

C

Y

E

W

I

X Y

W

1

(X ,Y ) form a DAGI

C

Y

E

W

I

X Y

W

1

X = (C,E)without knowing C,E

C ← Bc←i I + Bc←w W + εC

Y ← By←w W + By←cC + εY

E ← Be←i I + Be←w W + Be←cC + Be←y Y + εE

Page 64: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

provocative next step: how about using “Big Data”?

; structure the “large-scale” data into differentunknown groups of experimental settings Ethat is: learn E from data

“optimal” E is a trade-off betweenidentifiability and statistical power

Page 65: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

learning/estimating E

problem: given a “large bag of data”, can we estimate theunknown underlying different experimental conditions?

mathematically: denote byJ all experimental settings, satisfying Invariance Assumption; mixture distribution for e ∈ constr. E : F e =

∑j∈J we

j F j

when pooling two constr. experimental settings e1 and e2:; new mixture distribution with weights (we1 + we2)/2

- mixture modeling- change point modeling for (time) ordered datamight be useful to estimate E

Page 66: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

causal components remain the same fordifferent sub-populations or experimental settings

; exploit the power of heterogeneity in complex data!

and confidence bounds follow naturally

Page 67: Blessing of heterogeneous large-scale data for high ......Blessing of heterogeneous large-scale data for high-dimensional causal inference Peter Buhlmann¨ Seminar fur Statistik, ETH

Thank you!

SoftwareR-package: pcalg

(Kalisch, Machler, Colombo, Maathuis & PB, 2010–2015)R-package: InvariantCausalPrediction (Meinshausen, 2014)

References to some of our own work:I Peters, J., Buhlmann, P. and Meinshausen, N. (2015). Causal inference using invariant prediction:

identification and confidence intervals. To appear in J. Royal Statistical Society, Series B (with discussion).Preprint arXiv:1501.01332

I Meinshausen, N., Hauser, A. Mooij, J., Peters, J., Versteeg, P. and Buhlmann, P. and (2015). Causalinference from gene perturbation experiments: methods, software and validation. Preprint.

I Hauser, A. and Buhlmann, P. (2015). Jointly interventional and observational data: estimation ofinterventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal StatisticalSociety, Series B, 77, 291-318.

I Hauser, A. and Buhlmann, P. (2012). Characterization and greedy learning of interventional Markovequivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.

I Kalisch, M., Machler, M., Colombo, D., Maathuis, M.H. and Buhlmann, P. (2012). Causal inference usinggraphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.

I Stekhoven, D.J., Moraes, I., Sveinbjornsson, G., Hennig, L., Maathuis, M.H. and Buhlmann, P. (2012).Causal stability ranking. Bioinformatics 28, 2819-2823.

I Maathuis, M.H., Colombo, D., Kalisch, M. and Buhlmann, P. (2010). Predicting causal effects in large-scalesystems from observational data. Nature Methods 7, 247-248.

I Maathuis, M.H., Kalisch, M. and Buhlmann, P. (2009). Estimating high-dimensional intervention effects fromobservational data. Annals of Statistics 37, 3133-3164.