A statistical criterion for reducing indeterminacy in linear causal modeling

A statistical criterion for reducing indeterminacy inlinear causal modeling

Gianluca Bontempi

Machine Learning Group,Computer Science Department

ULB, Université Libre de BruxellesBoulevard de Triomphe - CP 212

Bruxelles, Belgiumhttp://mlg.ulb.ac.be

Causality in science

A major goal of the scientific activity is to model realphenomena by studying the dependency between entities,objects or more in general variables.Sometimes the goal of the modeling activity is simplypredicting future behaviors. Sometimes the goal is tounderstand the causes of a phenomenon (e.g. a disease).Understanding the causes of a phenomenon meansunderstanding the mechanisms by which the observed variablestake their values and predicting how the values of thosevariable would change if the mechanisms were subject tomanipulations.Applications: understanding which actions to perform on asystem to have a desired effect (eg. understanding the causesof tumor, the causes of the activation of a gene, the causes ofdifferent survival rates in a cohort of patients.)

Graphical models

Graphical models are a conventional formalism used to representcausal dependencies.

Coughing

Allergy

Smoking

Anxiety

Genetic factor

(a)

(d) (c)

Hormonalfactor

Metastasis

(b)

Other

cancers

Lung cancer

Figure 6: Markov blanket. The central node represents a disease of interest, which is ourtarget of prediction. The nodes in the shaded area include members of the Markov blanket.Given these nodes, the target is independent of the other nodes in the network. The lettersidentify local three-variable causal templates: (a) and (b): colliders, (c) fork, and (d) chain.

25

Inferring causes from observations

Inferring causal relationships from observational data is anopen challenge in machine learning.State-of-the-art approaches often rely on algorithms whichdetect v-structures in triplets of nodes in order to orient arcs.Bayesian networks techniques search for unshielded collidersi.e. patterns where two variables are both direct causes of athird one, without being each a direct cause of the other.Under assumptions of Causal Markov Condition andFaithfulness, this structure is statistically distinguishable andso-called constraint based algorithms (notably the PC and theSGS algorithms) rely on conditional independence tests toorient at least partially a graph.

Indeterminacy

Conditional independence algorithms are destined to fail whenconfronted with completely connected triplets. In this case thelack of independency makes ineffective the adoption ofconditional independency tests to infer the direction of thearrows.When there are no independencies, the direction of the arrowscan be anything. In other terms distinct causal patterns arenot distinguishable.If some causal patterns are not distinguishable in terms ofconditional independence, this doesn’t necessarily mean thatthey cannot be (at least partially) distinguished by using othermeasures.Are there statistics or measures that can help in making theproblem more distinguishable or in other terms provideinformation about which configuration is more probable?

Completely connected triplet

b2x1 x2

x3

b3

b1

w1 w2

w3

Contributions

Our paper proposes a criterion to deal with arc orientation alsoin presence of completely linearly connected triplets.This criterion is then used in a Relevance-Causal (RC)algorithm, which combines the original causal criterion with arelevance measure, to infer causal dependencies fromobservational data.A set of simulated experiments on the inference of the causalstructure of linear and nonlinear networks shows theeffectiveness of the proposed approach.

Motivations

We define a data-dependent measure able to reduce thestatistical indistinguishability of completely and linearlyconnected triplets.It is a modification of the covariance formula of a StructuralEquation Model (SEM) which results in a statistic takingopposite signs for different causal patterns when theunexplained variations of the variables are of the samemagnitude.Though this assumption could appear as too strong,assumptions of comparable strength (e.g. the existence ofunshielded colliders) have been commonly used so far in causalinference.We expect that this alternative approach could shed additionallight on the issue of causality in the perspective of extending itto more general configurations.

Completely connected triplet

b2x1 x2

x3

b3

b1

w1 w2

w3

Structural equation models (SEM)

Let us consider the linear causal structure represented by thecompletely connected triplet. In algebraic notation this correspondsto the system of linear equations

x1 = w1

x2 = b1x1 + w2

x3 = b3x1 + b2x2 + w3

where it is assumed that each variable has mean 0, the bi 6= 0 arealso known as structural coefficients and the disturbances,supposed to be independent, are designated by wi .

SEM matrix form

This set of equations can be put in the matrix form

x = Ax + w

where x = [x1, x2, x3]T ,

A =

0 0 0b1 0 0b3 b2 0

and w = [w1,w2,w3]T . The multivariate variance-covariancematrix has no zero entries and is given by

Σ = (I − A)−1G ((I − A)T )−1 (1)

where I is the identity matrix and

G =

σ21 0 00 σ2

2 00 0 σ2

3

is the diagonal covariance matrix of the disturbances.

Structural equation models and causal inference

Structural equation modeling techniques for causal inferenceproceed by

1 making some assumptions on the structure underlying thedata,

2 performing the related parameter estimation, usually based onmaximum likelihood and

3 assessing by significance testing the discrepancy between theobserved (sample) covariance matrix Σ and the covariancematrix Σh implied by the hypothesis.

Conventional SEM is not able to reconstruct the right directionalityof the connections in a completely connected triplet.

Indistinguishability in SEM

Suppose we want to test two alternative hypothesis,represented by the two directed graphs in the following slide.Note that the hypothesis on the left is correct while thehypothesis on the right inverses the directionality of the linkbetween x2 and x3 and consequently misses the causal role ofthe variable x2.Let us consider the following question: is it possible todiscriminate between the structures 1 and 2 by simply relyingon parameter estimation (in this case regression fitting)according to the hypothesized dependencies?The answer is unfortunately negative. Both the covariancematrices Σ1 and Σ2 coincide with Σ.

Two hypothesis

b2

x1 x2

x3

b3

b1

b 2

x1 x2

x3

b3

b1

Our criterion

We propose here is an alternative criterion able to performsuch distinction.The computation of our criterion requires the fitting of the twohypothetical structures as in conventional SEM.What is different is that, instead of computing the covarianceterm we consider the term

S = (I − A)−1((I − A)T )−1.

NOTA BENE: we will make the assumption thatσ1 = σ2 = σ3 = σ, i.e. that the unexplained variations of thevariables are of comparable magnitude.

Collider pattern

b2x1 x2

x3

b3

b1

w1 w2

w3

Two hypothesis

b2

x1 x2

x3

b3

b1

b 2

x1 x2

x3

b3

b1

Collider case

Let us suppose that data are generated according to a colliderstructure where the node x3 is a collider.We fit the hypothesis 1 to the data by using least-squares. Weobtain the matrix A1 and we compute the term

S1 = (I − A1)−1((I − A1)T )−1.

We fit the hypothesis 2 to data by using least-squares. Weobtain the matrix A2 and we compute the term S2.We showed that the quantity

C (x1, x2, x3) = S1[3, 3]− S2[3, 3] +(S1[2, 2]− S2[2, 2]

)is greater than zero for any sign of the structural coefficients.

Chain pattern

b 2

x1 x2

x3b3

b1

w2w1

w3

Chain pattern

Let us suppose that data are generated according to a chainstructure where x3 is part of the chain pattern x1 → x3 → x2.We fit the hypothesis 1 to the data and we compute the termS1

We fit the hypothesis 2 to data and we compute the term S2

We showed that the quantity

C (x1, x2, x3) = S1[3, 3]− S2[3, 3] +(S1[2, 2]− S2[2, 2]

)is less than zero whatever the sign of the structuralcoefficients bi

Fork pattern

b 2

x1 x2

x3

b3

b1

w1 w2

w3

Fork pattern

Let us suppose that data are generated according to a forkstructure where x3 is common cause of x1 and x2.We fit the hypothesis 1 to the data and we compute the termS1

We fit the hypothesis 2 to data and we compute the term S2

We showed that the quantity

C (x1, x2, x3) = S1[3, 3]− S2[3, 3] +(S1[2, 2]− S2[2, 2]

)is less than zero whatever the sign of the structuralcoefficients bi

Causal statistic

The previous results show that the computation of thequantity C on the basis of observational data only can help indiscriminating between the collider configuration i where thenodes x1 and x2 are direct causes of x3 (C > 0) and noncollider configurations (i.e. fork or chain) (C < 0).In other terms, given a completely connected triplet ofvariables, the quantity C (x1, x2, x3) returns useful informationabout the causal role of x1 and x2 with respect to x3 whateveris the strength or the direction of the link between x1 and x2.The properties of the quantity C encourage its use in analgorithm to infer directionality from observational data.

RC (Relevance Causal) algorithm

We propose then a RC (Relevance Causal) algorithm for linearcausal modeling inspired to the mIMR causal filter selectionalgorithm.It is a forward selection algorithm which given a set XS of dalready selected variables, updates this set by adding thed + 1th variable which satisfies

x∗d+1 = arg maxxk∈X−XS

[(1− λ)R({XS , xk}; y)+

λ

d

∑xi∈XS

C (xi ; xk ; y)]

(2)

where λ ∈ [0, 1] weights the R and the C contribution, the Rterm quantifies the relevance of the subset {XS , xk} and the Cterm quantifies the causal role of an input xk with respect tothe set of selected variables xi ∈ XS .

Experimental setting

The aim of the experiment is to reverse engineer both linearand nonlinear scale-free causal networks, i.e. networks wherethe distribution of the degree follows a power law, from alimited amount of observational data.We consider networks with n = 5000 nodes and the degree αof the power law ranges between 2.1 and 3. The inference isdone on the basis of N = 200 observations.We compare the accuracy of several algorithms in terms of themean F-measure (the higher, the better) averaged over 10runs and over all the nodes with a number of parents andchildren larger equal than two.

Experimental setting

We considered the following algorithms for comparison:the IAMB algorithm implemented by the Causal Explorersoftware which estimates for a given variable the set ofvariables belonging to its Markov blanket,the mIMR causal algorithm,the mRMR algorithm andthree versions of the RC algorithm with three different valuesλ = 0, 0.5, 1.

Note that the RC algorithm with λ = 0 boils down to aconventional wrapper algorithm based on the leave-one-outassessment of the variables’ subsets.

Linear inference

α IAMB mIMR mRMR RC0 RC0.5 RC1

2.2 0.375 0.324 0.319 0.386 0.421 0.3752.3 0.378 0.337 0.333 0.387 0.437 0.4012.4 0.376 0.342 0.342 0.385 0.441 0.4142.5 0.348 0.322 0.313 0.358 0.422 0.4132.6 0.347 0.318 0.311 0.355 0.432 0.4142.7 0.344 0.321 0.311 0.352 0.424 0.4232.8 0.324 0.304 0.293 0.334 0.424 0.4222.9 0.342 0.333 0.321 0.353 0.448 0.4593.0 0.321 0.319 0.297 0.326 0.426 0.448

F-measure (averaged over all nodes with a number of parents andchildren ≥ 2 and over 10 runs) of the accuracy of the inferrednetworks.

Nonlinear inference

α IAMB mIMR mRMR RC0 RC0.5 RC1

2.2 0.312 0.310 0.304 0.314 0.356 0.3242.3 0.317 0.328 0.316 0.320 0.375 0.3492.4 0.304 0.317 0.304 0.306 0.366 0.3512.5 0.321 0.327 0.328 0.325 0.379 0.3592.6 0.306 0.325 0.306 0.309 0.379 0.3652.7 0.313 0.319 0.303 0.316 0.380 0.3592.8 0.297 0.326 0.300 0.300 0.392 0.3822.9 0.310 0.329 0.313 0.313 0.389 0.3773.0 0.299 0.324 0.300 0.303 0.399 0.392

F-measure (averaged over all nodes with a number of parents andchildren ≥ 2 and over 10 runs) of the accuracy of the inferrednetwork.

Discussion

The results show the potential of the criterion C and of theRC algorithm in network inference tasks where dependenciesbetween parents are frequent because of direct links orcommon ancestors.According to the F-measures reported in the Tables the RCaccuracy with λ = 0.5 and λ = 1 is coherently better than theones of mIMR, mRMR and IAMB algorithms for all theconsidered degrees of the distribution.RC outperforms a conventional wrapper approach whichtargets only prediction accuracy (λ = 0 ) when a causalcriterion C is taken into account together with a predictive one(λ = 0.5).

Conclusions

Causal inference from complex large dimensional data is takinga growing importance in machine learning and knowledgediscovery.Currently, most of the existing algorithms are limited by thefact that the discovery of causal directionality is submitted tothe detection of a limited set of distinguishable patterns, likeunshielded colliders.However the scarcity of data and the intricacy of dependenciesin networks could make the detection of such patterns so rarethat the resulting precision would be unacceptable.This paper shows that it is possible to identify new statisticalmeasures helping in reducing indistinguishability under theassumption of equal variances of the unexplained variations ofthe three variables.

Conclusions

Though this assumption could be questioned, we deem that itis important to define new statistics to help discriminatingbetween causal structures for completely connected triplets inlinear causal modeling.Future work will focus on assessing whether such statistic isuseful in reducing indeterminacy also when the assumption ofequal variance is not satisfied.

Data & Analytics

A statistical criterion for reducing indeterminacy in linear causal modeling