Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Blessing of heterogeneous large-scale datafor high-dimensional causal inference
Peter Buhlmann
Seminar fur Statistik, ETH Zurich
joint work with
Jonas PetersMPI Tubingen
Nicolai MeinshausenSfS ETH Zurich
Heterogeneous large-scale data
Big Data
the talk is not (yet) on “really big data”
but we will take advantage of heterogeneityoften arising with large-scale data where
i.i.d./homogeneity assumption is not appropriate
causal inference = intervention analysis
in genomics (for yeast or plants):if we would make an intervention at a single (or many) gene(s),what would be its (their) effect on a response of interest?
want to infer/predict such effects without actually doing theinterventione.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)(from observations of a “steady-state system”)
or, as is mainly our focus: fromI observational and interventional data with well-specified
interventionsI changing environments or experimental settings with
“vaguely” or “un-” specified interventionsthat is, from “heterogeneous” data
causal inference = intervention analysis
in genomics (for yeast or plants):if we would make an intervention at a single (or many) gene(s),what would be its (their) effect on a response of interest?
want to infer/predict such effects without actually doing theinterventione.g. from observational data (cf. Pearl; Spirtes, Scheines & Glymour)(from observations of a “steady-state system”)
or, as is mainly our focus: fromI observational and interventional data with well-specified
interventionsI changing environments or experimental settings with
“vaguely” or “un-” specified interventionsthat is, from “heterogeneous” data
Genomics
1. Flowering of Arabidopsis Thaliana
phenotype/response variable of interest:Y = days to bolting (flowering)“covariates” X = gene expressions from p = 21′326 genes
question: infer/predict the effect of knocking-out a single geneon the phenotype/response variable Y?
using statistical method based on n = 47 observational data; validated the top-predictions with randomized experiments
with some moderate success(Stekhoven, Moraes, Sveinbjornsson, Hennig, Maathuis & PB, 2012)
Genomics
1. Flowering of Arabidopsis Thaliana
phenotype/response variable of interest:Y = days to bolting (flowering)“covariates” X = gene expressions from p = 21′326 genes
question: infer/predict the effect of knocking-out a single geneon the phenotype/response variable Y?
using statistical method based on n = 47 observational data; validated the top-predictions with randomized experiments
with some moderate success(Stekhoven, Moraes, Sveinbjornsson, Hennig, Maathuis & PB, 2012)
2. Gene expressions of yeast
p = 5360 genesphenotype of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes
and thenphenotype of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of single gene deletions on all othergenes
Effects of single gene deletions on all other genes (yeast)
(Maathuis, Colombo, Kalisch & PB, 2010)• p = 5360 genes (expression of genes)• 231 single gene deletions ; 1.2 · 106 intervention effects• the truth is “known in good approximation”
(thanks to intervention experiments)
goal: prediction of the true large intervention effectsbased on observational data with no knock-downs
n = 63observational data
False positives
True
pos
itive
s
0 1,000 2,000 3,000 4,000
0
200
400
600
800
1,000 IDALassoElastic−netRandom
REGRESSION
False positives
True
pos
itive
s
0 1,000 2,000 3,000 4,000
0
200
400
600
800
1,000 IDALassoElastic−netRandom
REGRESSION
because: for
Y =
p∑j=1
βjX (j) + ε
βj measures effect of X (j) on Y whenkeeping all other variables {X (k); k 6= j} fixed
but when doing an intervention at a gene ; some/many othergenes might change as well and cannot be kept fixed
causal inference framework: allows to definedynamic notion of intervention effect“without keeping other variables fixed”
False positives
True
pos
itive
s
0 1,000 2,000 3,000 4,000
0
200
400
600
800
1,000 IDALassoElastic−netRandom
“predictions of single genedeletions effects in yeast”
was a good finding for a difficultproblem...
... but some criticisms apply:I not very “robust”:
depends somewhat how we define the “truth”(has been found recently by J. Mooij, J. Peters and others)
I old, publicly available data (Hughes et al., 2000)I no collaborator... just (publicly available︸ ︷︷ ︸
this is good!
) data
I we didn’t use the interventional data for training
A new and “better” attempt for the “same” problemsingle gene knock-down in yeast and measure genomewideexpression (observational and interventional data)
goal: predict unseen gene knock-down effectsbased on both observational and interventional data
collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)
data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)
Graphical and Structural equation modelslate 1980s: Pearl; Spirtes, Glymour, Scheines; Dawid; Lauritzen;. . .
powerful language of graphs make formulations and modelsmore transparent (Evans and Richardson, 2011)
variables X1, . . . ,Xp+1 (Xp+1 = Y is the response of interest)
directed acyclic graph (DAG) D0 encoding the true underlyingcausal influence diagram
structural equation model (SEM):
Xj ← f 0j (XpaD0 (j), εj), j = 1, . . . ,p + 1,
ε1, . . . , εp+1 independent
e.g. linear Xj ←∑
k∈paD0 (j)
β0jkXk + εj j = 1, . . . ,p + 1
causal variables for Y = Xp+1: S0 = {j ; j ∈ paD0(Y )}causal coefficients for Y in linear SEM: β0
Yj , j ∈ paD0(Y )
severe issues of identifiability !
(X ,Y ) ∼ N2(0,Σ)
X XY Y
X causes Y Y causes X
agenda for estimation(Chickering, 2002; Shimizu, 2005; Kalisch & PB, 2007;... )
1. estimate the Markov equivalence class of DAGs D0: Dsevere issues of identifiability !
2. derive causal variables: the ones which are causal in allDAGs from D; derive bounds for causal effects based on D(Maathuis, Kalisch & PB, 2009)
goal: construction of confidence statements(without knowing the structure of the underlying graph)
problem:direct likelihood-based inference:
(as in e.g. Mladen Kolar’s talk for undirected graphs)difficult, because of severe identifiability issues !
nevertheless: our goal is to construct a procedureS ⊆ {1, . . . ,p} such that
P[ S ⊆ S0︸ ︷︷ ︸no false positives
] ≥ 1− α
without the need to specify identifiable componentsthat is: if a variable Xj cannot be identified to be causal⇒ j /∈ S
we do this with an entirely new framework(which might be “much simpler” for causal inference)NOT or AVOIDINGgraphical model fitting, potential outcome models,...and maybe more accessible to experts in regression modeling?
nevertheless: our goal is to construct a procedureS ⊆ {1, . . . ,p} such that
P[ S ⊆ S0︸ ︷︷ ︸no false positives
] ≥ 1− α
without the need to specify identifiable componentsthat is: if a variable Xj cannot be identified to be causal⇒ j /∈ S
we do this with an entirely new framework(which might be “much simpler” for causal inference)NOT or AVOIDINGgraphical model fitting, potential outcome models,...and maybe more accessible to experts in regression modeling?
Causal inference using invariant prediction
Peters, PB and Meinshausen (2015)
a main message:
causal structure/components remain the samefor different sub-populations
while the non-causal components can change acrosssub-populations
thus:; look for “stability” of structures among
different sub-populations
Causal inference using invariant prediction
Peters, PB and Meinshausen (2015)
a main message:
causal structure/components remain the samefor different sub-populations
while the non-causal components can change acrosssub-populations
thus:; look for “stability” of structures among
different sub-populations
goal:find the causal variables (components) among a p-dimensionalpredictor variable X for a specific response variable Y
consider data
(X e,Y e) ∼ F e, e ∈ E
with response variables Y e and predictor variables X e
here: e ∈ E︸︷︷︸space of exp. sett.
denotes an experimental setting
“heterogeneous” data fromdifferent environments/experiments e ∈ E
(aspect of “Big Data”)
goal:find the causal variables (components) among a p-dimensionalpredictor variable X for a specific response variable Y
consider data
(X e,Y e) ∼ F e, e ∈ E
with response variables Y e and predictor variables X e
here: e ∈ E︸︷︷︸space of exp. sett.
denotes an experimental setting
“heterogeneous” data fromdifferent environments/experiments e ∈ E
(aspect of “Big Data”)
data
(X e,Y e) ∼ F e, e ∈ E
example 1: E = {1,2} encoding observational (1) and allpotentially unspecific interventional data (2)
example 2: E = {1,2} encoding observational data (1) and(repeated) data from one specific intervention (2)
example 3: E = {1,2,3} ... or E = {1,2,3, . . . ,26} ...
do not need data from carefullydesigned (randomized) experiments
Invariance Assumption
for a set S∗ ⊆ {1, . . . ,p}: L(Y e|X eS∗) is invariant across e ∈ E
for linear model setting: there exists a vector γ∗ such that
∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X eS∗ , S∗ = {j ; γ∗j 6= 0}
εe ∼ Fε the same for all eX e has an arbitrary distribution, different across e
γ∗, S∗ is interesting in its own right!
namely the parameter and structure which remain invariantacross experimental settings, or across heterogeneous groups
Invariance Assumption
for a set S∗ ⊆ {1, . . . ,p}: L(Y e|X eS∗) is invariant across e ∈ E
for linear model setting: there exists a vector γ∗ such that
∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X eS∗ , S∗ = {j ; γ∗j 6= 0}
εe ∼ Fε the same for all eX e has an arbitrary distribution, different across e
γ∗, S∗ is interesting in its own right!
namely the parameter and structure which remain invariantacross experimental settings, or across heterogeneous groups
Moreover: link to causality
assume:
in short : L(Y e|X eS∗) is invariant across e ∈ E
Proposition (Peters, PB & Meinshausen, 2015)If E does not affect the structural equation for Y in a SEM:
e.g. linear SEM: Y e ←∑
k∈pa(Y )
βYk︸︷︷︸∀e
X ek + εe
Y︸︷︷︸∼Fε∀e
then S0 = pa(Y )︸ ︷︷ ︸causal var.
satisfies the Invariance Assumption (w.r.t. E)
the causal variables lead to invariance (of conditional distr.)
Moreover: link to causality
assume:
in short : L(Y e|X eS∗) is invariant across e ∈ E
Proposition (Peters, PB & Meinshausen, 2015)If E does not affect the structural equation for Y in a SEM:
e.g. linear SEM: Y e ←∑
k∈pa(Y )
βYk︸︷︷︸∀e
X ek + εe
Y︸︷︷︸∼Fε∀e
then S0 = pa(Y )︸ ︷︷ ︸causal var.
satisfies the Invariance Assumption (w.r.t. E)
the causal variables lead to invariance (of conditional distr.)
if E does not affect structural equation for Y :
S0 = pa(Y ) satisfies the Invariance Assumption
this assumption holds for example for:I do-intervention (Pearl) at variables different than YI noise (or “soft”) intervention (Eberhardt & Scheines, 2007)
at variables different than Y
in addition: there might be many other S∗ satisfying theInvariance Assumptionbut uniqueness is not really important (see later)
how do we know whetherE is not affecting structural equation for Y ?
if E does affect structural equation for Y :we will argue:
“robustness” of our procedure (proposed later)
; no causal statementsno false positivesconservative, but on the safe side
how do we know whetherE is not affecting structural equation for Y ?
if E does affect structural equation for Y :we will argue:
“robustness” of our procedure (proposed later)
; no causal statementsno false positivesconservative, but on the safe side
Invariance Assumption: plausible to hold with real data
two-dimensional conditional distributions of observational (blue)and interventional (orange) data(no intervention at displayed variables X ,Y )
seeminglyno invarianceof conditional d.
plausibleinvarianceof conditional d.
A procedure: population caserequire and exploit the Invariance Assumption
L(Y e|X eS∗) the same across e ∈ E
H0,γ,S(E) : γk = 0 if k /∈ S and∃Fε such that ∀ e ∈ E :
Y e = X eγ + εe, εe ⊥ X eS , ε
e ∼ Fε the same for all e
if H0,γ,S(E) is true:I model is correctI S, γ are plausible causal variables/predictors and
coefficientsand
H0,S(E) : there exists γ such that H0,γ,S(E) holds
S is called “plausible causal predictors” if H0,S(E) holds
A procedure: population caserequire and exploit the Invariance Assumption
L(Y e|X eS∗) the same across e ∈ E
H0,γ,S(E) : γk = 0 if k /∈ S and∃Fε such that ∀ e ∈ E :
Y e = X eγ + εe, εe ⊥ X eS , ε
e ∼ Fε the same for all e
if H0,γ,S(E) is true:I model is correctI S, γ are plausible causal variables/predictors and
coefficientsand
H0,S(E) : there exists γ such that H0,γ,S(E) holds
S is called “plausible causal predictors” if H0,S(E) holds
identifiable causal predictors under E :
is defined as the set S(E), where
S(E) =⋂{ S; H0,S(E) holds︸ ︷︷ ︸plausible causal predictors
}
the intersection of all plausible causal predictors
under the Invariance Assumption we have: for any S∗,
S(E) ⊆ S∗
and this is key to obtain confidence bounds for identifiablecausal predictors
identifiable causal predictors under E :
is defined as the set S(E), where
S(E) =⋂{ S; H0,S(E) holds︸ ︷︷ ︸plausible causal predictors
}
the intersection of all plausible causal predictors
under the Invariance Assumption we have: for any S∗,
S(E) ⊆ S∗
and this is key to obtain confidence bounds for identifiablecausal predictors
we have by definition:
S(E1) ⊆ S(E2) if E1 ⊆ E2
withI more interventionsI more “heterogeneity”I more “diversity in complex data”
we can identify more causal predictors
identifiable causal predictors : S(E)↗ as E ↗
question: when is S(E) = S0 ?but it is not important that it is “equal to” or unique(see later)
Theorem (Peters, PB and Meinshausen, 2015)
S(E) = S0 = (parental set of Y in the causal DAG)
if there is:I a single do-intervention for each variable other than Y and |E| = pI a single noise intervention for each variable other than Y and |E| = pI a simultaneous noise intervention and |E| = 2
the conditions can be relaxed such that it is not necessary to intervene at allthe variables
Statistical confidence sets for causal predictors
“the finite sample version of S(E) =⋂
S{S; H0,S(E) is true}”
for “any” S ⊆ {1, . . . ,p}:test whether H0,S(E) is accepted or rejected
S(E) =⋂S
{H0,S accepted at level α}
for H0,S(E):test constancy of regression param. and its residual error distr.across e ∈ E
by weakening H0,S(E) to H0,S(E):
∃β, σ : βepred(S) ≡ β, σe
pred(S) ≡ σ ∀e ∈ E
where
βepred(S) = argminβ;βk=0 (k /∈S)E|Y e − X eβ|2,
σepred(S) = (E|Y e − X eβe
pred(S)|2)1/2
note: H0,S(E) true =⇒ H0,S(E) true
testing H0,S(E): assuming Gaussian errors
DT Σ−1D D
σ2ne∼ F (ne,n−e − |S| − 1)
D = Y e − Y e, Y e based on data \{e}ΣD = Ine + Xe,S(X T
−e,SX−e,S)−1X Te,S
reject H0,S(E) if p-value < α/|E|
S(E) =⋂S
{H0,S accepted at level α}
for some significance level 0 < α < 1
going through all sets S?
S(E) =⋂S
{H0,S accepted at level α}
for some significance level 0 < α < 1
going through all sets S?
going through all sets S?
1. start with S = ∅: if H0,∅(E) accepted =⇒ S(E) = ∅2. consider small sets S of cardinality 1,2, . . .
and construct corresponding intersections S∩ withpreviously considered accepted sets S (H0,S(E) accepted)
for S with H0,S accepted :
S∩ ← S∩ ∩ S
if intersection S∩ = ∅ =⇒ S(E) = ∅if not:
discard all S with S ⊇ S∩
and continue with the remaining sets3. for large p:
restrict search space by variables from Lasso regression;need a faithfulness assumption (and sparsity and assumptionson X e for justification)
confidence sets with invariant prediction
1. for each S ⊆ {1, . . . ,p} construct a set ΓS(E) as follows:
if H0,S(E) rejected at α/(2|E|) ; set ΓS(E) = ∅otherw. ΓS(E) = classic. (1− α/2) CI for βS based on XS
2. set
Γ(E) =⋃
S⊆{1,...,p}
ΓS(E) (est. plausible causal coeff.)
we then obtain:
confidence set for S∗ : S(E)
confidence set for γ∗ : Γ(E)
Theorem (Peters, PB and Meinshausen, 2015)
assume: linear model, Gaussian errors
then, for any γ∗,S∗ satisfying the Invariance Assumption:
P[S(E) ⊆ S∗] ≥ 1− α : confidence w.r.t. true causal var.P[γ∗ ∈ Γ(E)] ≥ 1− α : confidence set for true causal param.
and can choose S∗ = S0 = causal variables in linear SEMif E does not affect struct. eqn. of Y
“on the safe side” (conservative)
we do not need to care about identifiability: if the effect is notidentifiable, the method will not wrongly claim an effect
we do not require that S∗ is minimal and unique
Theorem (Peters, PB and Meinshausen, 2015)
assume: linear model, Gaussian errors
then, for any γ∗,S∗ satisfying the Invariance Assumption:
P[S(E) ⊆ S∗] ≥ 1− α : confidence w.r.t. true causal var.P[γ∗ ∈ Γ(E)] ≥ 1− α : confidence set for true causal param.
and can choose S∗ = S0 = causal variables in linear SEMif E does not affect struct. eqn. of Y
“on the safe side” (conservative)
we do not need to care about identifiability: if the effect is notidentifiable, the method will not wrongly claim an effect
we do not require that S∗ is minimal and unique
“the first” result on statistical confidence for potentiallynon-identifiable causal predictors when structure is unknown(route via graphical modeling for confidence sets seems awkward)
leading to (hopefully) morereliable causal inferential statements
“Robustness”
if Invariance Assumption does not holdbecause E has a direct effect on Y
; for all S: L(Y e|X eS) is not invariant across e ∈ E
; S(E) = ∅ (at least as n→∞)still on the safe side but no power
Empirical results: simulations100 different scenarios, 1000 data sets per scenario:|E| = 2, nobs = ninterv ∈ {100, . . . ,500}, p ∈ {5, . . . ,40}
SU
CC
ES
S P
RO
BA
BIL
ITY
Mar
gina
lM
argi
nal
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Reg
ress
ion
Reg
ress
ion
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
Ges
Ges
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
Gie
s (u
nkno
wn)
Gie
s (u
nkno
wn)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
Gie
s (k
now
n)G
ies
(kno
wn) ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
Ling
amLi
ngam
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
Inva
riant
pre
dict
ion
Inva
riant
pre
dict
ion
0.00
0.25
0.50
0.75
1.00
power todetect causal predictors
FW
ER
Mar
gina
l Mar
gina
l
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Reg
ress
ion R
egre
ssio
n
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
Ges
Ges
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Gie
s (u
nkno
wn)
Gie
s (u
nkno
wn)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
Gie
s (k
now
n)G
ies
(kno
wn)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
Ling
amLi
ngam
●
●
● ●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●● ●●● ●● ●●● ● ●●●● ●●● ●●●●●● ●●
●●●●
●●●● ●●●●● ●●● ●● ●●● ●●
●
●● ●● ● ●●
●●● ●●● ●●
●●● ●● ●●●●● ●●● ●●● ● ●●●●●
●● ●● ●● ●● ●●
●
●
Inva
riant
pre
dict
ion In
varia
nt p
redi
ctio
n
0.00
0.50
1.00
familywise error rate:P[S(E) 6⊆ S∗], aimed at 0.05
Single gene deletion experiments in yeast
p = 6170 genesresponse of interest: Y = expression of first gene“covariates” X = gene expressions from all other genes
and thenresponse of interest: Y = expression of second gene“covariates” X = gene expressions from all other genes
and so on
infer/predict the effects of a single gene knock-down on allother genes
collaborators:Frank Holstege, Patrick Kemmeren et al. (Utrecht)
data from modern technologyKemmeren, ..., and Holstege (Cell, 2014)
Kemmeren et al. (2014):genome-wide mRNA expressions in yeast: p = 6170 genes
I nobs = 160 “observational” samples of wild-typesI nint = 1479 “interventional” samples
each of them corresponds to a single gene deletion strain
for our method:I we use |E| = 2 (observational and interventional data)I training-test data:• training: all observational and 2/3 of interventional data• test: other 1/3 of gene deletion interventions• repeat this for the three blocks of interventional data
I since every interventional data point is used once as aresponse variable:we use coverage 1− α/nint with α = 0.01 and nint = 1479
Results
8 genes are significant (α = 0.01 level) causal variables(each of the 8 genes “causes” another gene)
validation with test datamethod invar.pred. GIES PC-IDA marg.corr. rand.guess.
no. true pos. 6 2 2 2 *(out of 8)
*: quantiles for selecting true positives among 7 random draws2 (95%), 3 (99%)
; our invariant prediction method has most power !and it should exhibit control against false positive selections
# INTERVENTION PREDICTIONS
# S
TR
ON
G IN
TE
RV
EN
TIO
N E
FF
EC
TS
0 5 10 15 20 25
02
46
8
PERFECTINVARIANTHIDDEN−INVARIANTPCRFCIREGRESSION (CV−Lasso)GES and GIESRANDOM (99% prediction− interval)
I : invariant prediction methodH: invariant prediction with some hidden variables
Validation (Meinshausen, Hauser, Mooij, Peters, Versteeg & PB, 2015)with intervention experiments: strong intervention effect (SIE)with yeastgenome.org database: scores A-F
rank cause effect SIE A B C D E F1 YMR104C YMR103C X2 YPL273W YMR321C X3 YCL040W YCL042W X4 YLL019C YLL020C X5 YMR186W YPL240C X X X X X X6 YDR074W YBR126C X X X X X X7 YMR173W YMR173W-A X8 YGR162W YGR264C9 YOR027W YJL077C X
10 YJL115W YLR170C11 YOR153W YDR011W X X12 YLR270W YLR345W13 YOR153W YBL005W14 YJL141C YNR007C15 YAL059W YPL211W16 YLR263W YKL098W17 YGR271C-A YDR339C18 YLL019C YGR130C19 YCL040W YML100W20 YMR310C YOR224C
SIE: correctly predicting a strong intervention effect which is inthe 1%- or 99% tail of the observational data
Flow cytometry data (Sachs et al., 2005)
I p = 11 abundances of chemical reagentsI 8 different environments (not “well-defined” interventions)
(one of them observational; 7 different reagents added)I each environment contains ne ≈ 700− 1′000 samples
goal:recover network of causal relations (linear SEM)
Raf
Mek
PLCg
PIP2
PIP3
Erk
Akt
PKA
PKC
p38
JNK
approach: invariant causal prediction(one variable the response Y ; the other 10 the covariates X ;
do this 11 times with every variable once the response)
main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail
main concern: different environments might have a directinfluence on the response ; Invariance Assumption would fail
instead of requiring invariance among 8 environments; require invariance for pairs of environments Eij (28 pairs intotal) and do Bonferroni correction by taking the union
S(E) = ∪i<j Sα/28(Eij)
this is weakening the assumption that an intervention shouldnot directly influence the response Y
and if a pair of interventions does nevertheless; S = ∅ (“robustness” as discussed before)
Raf
Mek
PLCg
PIP2
PIP3
Erk
Akt
PKA
PKC
p38
JNK
blue edges: only invariant causal prediction approach (ICP)red: only ICP allowing hidden variables and feedbackpurple: both ICP with and without hidden variablessolid: all relations that have been reported in literaturebroken: new findings not reported in the literature
; reasonable consensus with existing resultsbut no real ground-truth available
serves as an illustration that we can work with “vaguely definedinterventions”
Concluding thoughts
generalize Invariance Assumption and statistical testing tononparametric/nonlinear modelsin particular additive models
∀e ∈ E : Y e = f ∗(X eS∗) + εe, εe ∼ Fε, εe ⊥ XS∗
∀e ∈ E : Y e =∑j∈S∗
f ∗j (X ej ) + εe, εe ∼ Fε, εe ⊥ XS∗
the statistical significance testing becomes more difficultimproved identifiability with nonlinear SEMs (Mooij et al., 2009)
generalize to include hidden variables
S0 = pa(Y )∩observed variables, S0H = pa(Y )∩hidden variables
presented procedure still OK if (essentially):- no interventions at Y , at S0
H and ancestors(S0H)
- no hidden confounder between Y and S0
Y
H1 H2H3
S0X1
X8
X12
more general hidden variable models can be treated with a“related” technique
(Rothenhausler, Heinze, Peters & Meinshausen, 2015)
I
C
Y
E
W
I
X Y
W
1
(X ,Y ) form a DAGI
C
Y
E
W
I
X Y
W
1
X = (C,E)without knowing C,E
C ← Bc←i I + Bc←w W + εC
Y ← By←w W + By←cC + εY
E ← Be←i I + Be←w W + Be←cC + Be←y Y + εE
provocative next step: how about using “Big Data”?
; structure the “large-scale” data into differentunknown groups of experimental settings Ethat is: learn E from data
“optimal” E is a trade-off betweenidentifiability and statistical power
learning/estimating E
problem: given a “large bag of data”, can we estimate theunknown underlying different experimental conditions?
mathematically: denote byJ all experimental settings, satisfying Invariance Assumption; mixture distribution for e ∈ constr. E : F e =
∑j∈J we
j F j
when pooling two constr. experimental settings e1 and e2:; new mixture distribution with weights (we1 + we2)/2
- mixture modeling- change point modeling for (time) ordered datamight be useful to estimate E
causal components remain the same fordifferent sub-populations or experimental settings
; exploit the power of heterogeneity in complex data!
and confidence bounds follow naturally
Thank you!
SoftwareR-package: pcalg
(Kalisch, Machler, Colombo, Maathuis & PB, 2010–2015)R-package: InvariantCausalPrediction (Meinshausen, 2014)
References to some of our own work:I Peters, J., Buhlmann, P. and Meinshausen, N. (2015). Causal inference using invariant prediction:
identification and confidence intervals. To appear in J. Royal Statistical Society, Series B (with discussion).Preprint arXiv:1501.01332
I Meinshausen, N., Hauser, A. Mooij, J., Peters, J., Versteeg, P. and Buhlmann, P. and (2015). Causalinference from gene perturbation experiments: methods, software and validation. Preprint.
I Hauser, A. and Buhlmann, P. (2015). Jointly interventional and observational data: estimation ofinterventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal StatisticalSociety, Series B, 77, 291-318.
I Hauser, A. and Buhlmann, P. (2012). Characterization and greedy learning of interventional Markovequivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.
I Kalisch, M., Machler, M., Colombo, D., Maathuis, M.H. and Buhlmann, P. (2012). Causal inference usinggraphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.
I Stekhoven, D.J., Moraes, I., Sveinbjornsson, G., Hennig, L., Maathuis, M.H. and Buhlmann, P. (2012).Causal stability ranking. Bioinformatics 28, 2819-2823.
I Maathuis, M.H., Colombo, D., Kalisch, M. and Buhlmann, P. (2010). Predicting causal effects in large-scalesystems from observational data. Nature Methods 7, 247-248.
I Maathuis, M.H., Kalisch, M. and Buhlmann, P. (2009). Estimating high-dimensional intervention effects fromobservational data. Annals of Statistics 37, 3133-3164.