More Causal Inference with Graphical Models in R …...More Causal Inference with Graphical Models in R Package pcalg Markus Kalisch ETH Zurich Martin M¨achler ETH Zurich Diego Colombo

More Causal Inference with Graphical Models in R

Package pcalg

Markus KalischETH Zurich

Martin MachlerETH Zurich

Diego ColomboETH Zurich

Alain HauserUniversity of Bern

Marloes H. MaathuisETH Zurich

Peter BuhlmannETH Zurich

Abstract

The pcalg package for R (R Development Core Team 2014) can be used for the followingtwo purposes: Causal structure learning and estimation of causal effects from observationaland/or interventional data. In this document, we give a brief overview of the methodology,and demonstrate the package’s functionality in both toy examples and applications.

This vignette is an updated and extended (FCI, RFCI, etc) version of Kalisch et al.(2012) which was for pcalg 1.1-4.

Keywords: IDA, PC, RFCI, FCI, GES, GIES, do-calculus, causality, graphical model, R.

1. Introduction

Understanding cause-effect relationships between variables is of primary interest in many fieldsof science. Usually, experimental intervention is used to find these relationships. In manysettings, however, experiments are infeasible because of time, cost or ethical constraints.

We therefore consider the problem of inferring causal information from observational data.Under some assumptions, the algorithms PC (see Spirtes et al. 2000), FCI (see Spirtes et al.2000, 1999a), RFCI (see Colombo et al. 2012) and GES (see Chickering 2002) can infer infor-mation about the causal structure from observational data; there also exists a generalizationof GES to interventional data, GIES (see Hauser and Buhlmann 2012). These algorithmstell us which variables could or could not be a cause of some variable of interest. They donot, however, give information about the size of the causal effects. We therefore developedthe IDA method (Maathuis et al. 2009), which can infer bounds on causal effects based onobservational data under some assumptions and in particular that no hidden and selectionvariables are present. IDA is a two step approach that combines the PC algorithm and Pearl’sbackdoor criterion (Pearl 1993), which has been designed for DAGs without latent variables.IDA was validated on a large-scale biological system (see Maathuis et al. 2010). Since theassumption of no latent variables is a strong assumption when working with real data andtherefore often violated in practice, we generalized Pearl’s backdoor criterion (see Maathuisand Colombo 2013) to more general types of graphs, i.e. CPDAGs, MAGs, and PAGs, thatdescribe Markov equivalence classes of DAGs with and without latent variables but without

2 More Causal Graphical Models: Package pcalg

selection variables.

For broader use of these methods, well documented and easy to use software is indispensable.We therefore wrote the R package pcalg, which contains implementations of the algorithmsPC, FCI, RFCI, GES and GIES, as well as of the IDA method and the generalized Pearl’sbackdoor criterion. The objective of this paper is to introduce the R package pcalg, explainthe range of functions on simulated data sets and summarize some applications.

To get started, we show how two of the main functions (one for causal structure learning andone for estimating causal effects from observational data) can be used in a typical applica-tion. Suppose we have a system described by some variables and many observations of thissystem. Furthermore, assume that it seems plausible that there are no hidden variables andno feedback loops in the underlying causal system. The causal structure of such a system canbe conveniently represented by a directed acyclic graph (DAG), where each node representsa variable and each directed edge represents a direct cause. To fix ideas, we have simulatedan example data set with p = 8 continuous variables with Gaussian noise and n = 5000observations, which we will now analyze. First, we load the package pcalg and the data set.

> library("pcalg")

> data("gmG")

In the next step, we use the function pc() to produce an estimate of the underlying causalstructure. Since this function is based on conditional independence tests, we need to define twothings. First, we need a function that can compute conditional independence tests in a waythat is suitable for the data at hand. For standard data types (Gaussian, discrete and binary)we provide predefined functions. See the example section in the help file of pc() for moredetails. Secondly, we need a summary of the data (sufficient statistic) on which the conditionalindependence function can work. Each conditional independence test can be performed at acertain significance level alpha. This can be treated as a tuning parameter. In the followingcode, we use the predefined function gaussCItest() as conditional independence test andcreate the corresponding sufficient statistic, consisting of the correlation matrix of the dataand the sample size. Then we use the function pc() to estimate the causal structure and plotthe result.

> suffStat <- list(C = cor(gmG8$x), n = nrow(gmG8$x))

> pc.gmG <- pc(suffStat, indepTest = gaussCItest,

p = ncol(gmG8$x), alpha = 0.01)

As can be seen in Fig. 1, there are directed and bidirected edges in the estimated causal struc-ture. The directed edges show the presence and direction of direct causal effects. A bidirectededge means that the PC-algorithm was unable to decide whether the edge orientation shouldbe or . Thus, bidirected edges represent some uncertainty in the resulting model.They reflect the fact that in general one cannot estimate a unique DAG from observationaldata, not even with an infinite amount of data, since several DAGs can describe the sameconditional independence information.

On the inferred causal structure, we can estimate the causal effect of an intervention. Denotethe variable corresponding to node i in the graph by Vi. For example, suppose that, byexternal intervention, we first set the variable V1 to some value x, and then to the value x+1.The recorded average change in variable V6 is the (total) causal effect of V1 on V6. More

Kalisch, Machler, Colombo, Hauser, Maathuis, Buhlmann 3

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

1

2

3

4

5

6

7

8

Figure 1: True underlying causal DAG (left) and estimated causal structure (right), represent-ing a Markov equivalence class of DAGs that all encode the same conditional independenceinformation. (Due to the large sample size, there were no sampling errors.)

precisely, the causal effect C(V1, V6, x) of V1 from V1 = x on V6 is defined as

C(V1, V6, x) = E(V1| do(V6 = x+ 1))− E(V1| do(V6 = x)) or

C(V1, V6, x) =∂

∂xE(V1| do(V6 = x))|x=x,

where do(V6 = x) denotes Pearl’s do-operator (see Pearl 2000). If the causal relationships arelinear, these two expressions are equivalent and do not depend on x.

Since the causal structure was not identified uniquely in our example, we cannot expect toget a unique number for the causal effect. Instead, we get a set of possible causal effects. Thisset can be computed by using the function ida(). To provide full quantitative information,we need to pass the covariance matrix in addition to the estimated causal structure.

> ida(1, 6, cov(gmG8$x), pc.gmG@graph)

[1] 0.5321 0.5304

Since we simulated the data, we know that the true value of the causal effect is 0.52. Thus,one of the two estimates is indeed close to the true value. Since both values are larger thanzero, we can conclude that variable V1 has a positive causal effect on variable V6. Thus, wecan always estimate a lower bound for the absolute value of the causal effect. (Note that atthis point we have no p-value to control the sampling error.)

If we would like to know the effect of a unit increase in variable V1 on variables V4, V5 andV6, we could simply call ida() three times. However, a faster way is to call the functionidaFast(), which was tailored for such situations.

> idaFast(1, c(4,5,6), cov(gmG8$x), pc.gmG@graph)

[,1] [,2]

4 -0.010329 -0.0098746


5 0.043770 -0.0056205

6 0.532096 0.5303967

Each row in the output shows the estimated set of possible causal effects on the target variableindicated by the row names. The true values for the causal effects are 0, 0.05, 0.52 for variablesV4, V5 and V6, respectively. The first row, corresponding to variable V4, quite accuratelyindicates a causal effect that is very close to zero or no effect at all. The second row of theoutput, corresponding to variable V5, is rather uninformative: Although one entry comes closeto the true value, the other estimate is close to zero. Thus, we cannot be sure if there is acausal effect at all. The third row, corresponding to V6 was already discussed above.

2. Methodological background

In Section 2.1 we propose methods for estimating the causal structure. In particular, wediscuss algorithms for structure learning

• in the absence of hidden variables from observational data such as PC (see Spirtes et al.2000), GES (see Chickering 2002), and the dynamic programming approach of Silanderand Myllymaki (2006),

• from observational data accounting for hidden variables such as FCI (see Spirtes et al.2000, 1999a) and RFCI (see Colombo et al. 2012),

• in the absence of hidden variables from jointly observational and interventional datasuch as GIES (see Hauser and Buhlmann 2012).

In Section 2.2 we first describe the IDA method (see Maathuis et al. 2009) to obtain bounds oncausal effects from observational data when no latent and selection variables are present. Thismethod is based on first estimating the causal structure and then applying do-calculus (seePearl 2000). We then propose the generalized Pearl’s backdoor criterion (see Maathuis andColombo 2013) that works with DAGs, CPDAGs, MAGs, and PAGs as input and it assumesthat there are arbitrarly many latent variables but no selection variables. This method isbased on two steps: first it checks if the total causal effect of one variable X onto anothervariables Y is identifiable via the generalized backdoor criterion in the given type of graph,and if this is the case it explicitly gives a set of variables that satisfies the generalized backdoorcriterion with respect to X and Y in the given graph.

2.1. Estimating causal structures with graphical models

Graphical models can be thought of as maps of dependence structures of a given probabilitydistribution or a sample thereof (see for example Lauritzen 1996). In order to illustrate theanalogy, let us consider a road map. In order to be able to use a road map, one needs two givenfactors. First, one needs the physical map with symbols such as dots and lines. Second, oneneeds a rule for interpreting the symbols. For instance, a railroad map and a map for electriccircuits might look very much alike, but their interpretation differs a lot. In the same sense,a graphical model is a map. First, a graphical model consists of a graph with dots, lines andpotentially edge marks like arrowheads or circles. Second, a graphical model always comeswith a rule for interpreting this graph. In general, nodes in the graph represent (random)variables and edges represent some kind of dependence.


Without hidden and selection variables

An example of a graphical model is the DAG model. The physical map here is a graphconsisting of nodes and directed edges ( or ). As a further restriction, the edges mustbe directed in a way, so that it is not possible to trace a cycle when following the arrowheads(i.e., no directed cycles). The interpretation rule is called d-separation. This rule is a bitintricate and we refer the reader to Lauritzen (1996) for more details. This interpretationrule can be used in the following way: If two nodes x and y are d-separated by a set of nodesS, then the corresponding random variables Vx and Vy are conditionally independent giventhe set of random variables VS . For the following, we only deal with distributions whose list ofconditional independencies perfectly matches the list of d-separation relations of some DAG;such distributions are called faithful. It has been shown that the set of distributions that arefaithful is the overwhelming majority (Meek 1995), so that the assumption does not seem tobe very strict in practice.

Since the DAG model encodes conditional independencies, it seems plausible that informationon the latter helps to infer aspects of the former. This intuition is made precise in the PCalgorithm (see Spirtes et al. (2000); PC stands for the initials of its inventors Peter Spirtesand Clark Glymour) which was proven to reconstruct the structure of the underlying DAGmodel given a conditional independence oracle up to its Markov equivalence class which isdiscussed in more detail below. In practice, the conditional independence oracle is replacedby a statistical test for conditional independence. For situations without hidden variablesand under some further conditions it has been shown that the PC algorithm using statisticaltests instead of an independence oracle is computationally feasible and consistent even forvery high-dimensional sparse DAGs (see Kalisch and Buhlmann 2007).

As mentioned before, several DAGs can encode the same list of conditional independencies.One can show that such DAGs must share certain properties. To be more precise, we have todefine a v-structure as the subgraph i j k on the nodes i, j and k where i and k are notadjacent (i.e., there is no edge between i and k). Furthermore, let the skeleton of a DAG bethe graph that is obtained by removing all arrowheads from the DAG. It was shown that twoDAGs encode the same conditional independence statements if and only if the correspondingDAGs have the same skeleton and the same v-structures (see Verma and Pearl 1990). SuchDAGs are called Markov-equivalent. In this way, the space of DAGs can be partitioned intoequivalence classes, where all members of an equivalence class encode the same conditionalindependence information. Conversely, if given a conditional independence oracle, one canonly determine a DAG up to its equivalence class. Therefore, the PC algorithm cannotdetermine the DAG uniquely, but only the corresponding equivalence class of the DAG.

An equivalence class can be visualized by a graph that has the same skeleton as every DAG inthe equivalence class and directed edges only where all DAGs in the equivalence class have thesame directed edge. Edges that point into one direction for some DAGs in the equivalenceclass and in the other direction for other DAGs in the equivalence class are visualized bybidirected edges (sometimes, undirected edges are used instead). This graph is called acompleted partially directed acyclic graph, CPDAG (Spirtes et al. 2000), or essential graph(Andersson et al. 1997).

We now describe the PC-algorithm, which is shown in Algorithm 1, in more detail. ThePC-algorithm starts with a complete undirected graph, G0, as stated in (P1) of Algorithm 1.In stage (P2), a series of conditional independence tests is done and edges are deleted in the


Algorithm 1 Outline of the PC-algorithm

Input: Vertex set V, conditional independence information, significance level αOutput: Estimated CPDAG G, separation sets S

Edge types: ,(P1) Form the complete undirected graph on the vertex set V(P2) Test conditional independence given subsets of adjacency sets at a given significancelevel α and delete edges if conditional independent(P3) Orient v-structures(P4) Orient remaining edges.

following way. First, all pairs of nodes are tested for marginal independence. If two nodes i

and j are judged to be marginally independent at level α, the edge between them is deletedand the empty set is saved as separation sets S[i, j] and S[j, i]. After all pairs have beentested for marginal independence and some edges might have been removed, a graph resultswhich we denote by G1. In the second step, all pairs of nodes (i, j) still adjacent in G1 aretested for conditional independence given any single node in adj(G1, i)\{j} or adj(G1, j)\{i}(adj(G, i) denotes the set of nodes in graph G that are adjacent to node i) . If there is anynode k such that Vi and Vj are conditionally independent given Vk, the edge between i andj is removed and node k is saved as separation sets (sepset) S[i, j] and S[j, i]. If all adjacentpairs have been tested given one adjacent node, a new graph results which we denote by G2.The algorithm continues in this way by increasing the size of the conditioning set step by step.The algorithm stops if all adjacency sets in the current graph are smaller than the size of theconditioning set. The result is the skeleton in which every edge is still undirected. Within(P3), each triple of vertices (i, k, j) such that the pairs (i, k) and (j, k) are each adjacentin the skeleton but (i, j) are not (such a triple is called an “unshielded triple”), is orientedbased on the information saved in the conditioning sets S[i, j] and S[j, i]. More precisely, anunshielded triple i k j is oriented as i k j if k is not in S[j, i] = S[i, j]. Finally, in(P4) it may be possible to orient some of the remaining edges, since one can deduce that oneof the two possible directions of the edge is invalid because it introduces a new v-structureor a directed cycle. Such edges are found by repeatedly applying rules described in Spirteset al. (2000), p.85. The resulting output is the equivalence class (CPDAG) that describes theconditional independence information in the data, in which every edge is either undirected ordirected. (To simplify visual presentation, undirected edges are depicted as bidirected edgesin the output as soon as at least one directed edge is present. If no directed edge is present,all edges are undirected.)

It is known that the PC algorithm is order-dependent in steps (P2)–(P4), meaning thatthe output depends from the order in which the variables are given. Colombo and Maathuis(2013) proposed several modifications of the PC algorithm (see Sections 3.1 and 3.2) thatpartly or fully remove these order-dependence issues in each step.

The PC algorithm presented so far is based on conditional independence tests. Score-basedmethods form an alternative approach to causal inference. They try to find a CPDAG thatmaximizes a score, typically a model selection criterion, which is calculated from data.

One of the most popular scores is the Bayesian information criterion (BIC) because its max-imization leads to consistent model selection in the classical large-sample limit (Haughton1988; Geiger et al. 2001). However, computing its maximum is an NP-hard problem (Chick-


ering 1996). An exhaustive search is computationally infeasible due to the size of the searchspace, the space of DAGs or CPDAGs, respectively. Silander and Myllymaki (2006) havepresented an exact dynamic programming algorithm with an exponential time complexity.Its execution is feasible for models with a few dozen variables.

The greedy equivalence search (GES) of Chickering (2002) makes the maximization of the BICcomputationally feasible for much larger graphs. As the name of the algorithm implies, GESmaximizes the BIC in a greedy way, but still guarantees consistency in the large-sample limit.It still has exponential-time complexity in the worst case, but only polynomial complexity inthe average case where the size of the largest clique in a graph grows only logarithmicallywith the number of nodes (Grimmett and McDiarmid 1975).

GES greedily optimizes the BIC in two phases:

• In the forward phase, the algorithm starts with the empty graph. It then sequentiallymoves to larger CPDAGs by operations that correspond to adding single arrows in thespace of DAGs. This phase is aborted if no augmentation of the BIC is possible anymore.

• In the backward phase, the algorithm moves again into the direction of smaller graphsby operations that correspond to removing single arrows in the space of DAGs. Thealgorithm terminates as soon as no augmentation of the BIC is possible any more.

A key ingredient for the fast exploration of the search space in GES is an evaluation ofthe greedy steps in a local fashion which avoids enumerating all representative DAGs of anequivalence class and which exploits the decomposability of the BIC score (Chickering 2002).

A causal structure without feedback loops and without hidden or selection variable can bevisualized using a DAG where the edges indicate direct cause-effect relationships. Undersome assumptions, Pearl (2000) showed (Theorem 1.4.1) that there is a link between causalstructures and graphical models. Roughly speaking, if the underlying causal structure isa DAG, we observe data generated from this DAG and then estimate a DAG model (i.e.,a graphical model) on this data, the estimated CPDAG represents the equivalence class ofthe DAG model describing the causal structure. This holds if we have enough samples andassuming that the true underlying causal structure is indeed a DAG without latent or selectionvariables. Note that even given an infinite amount of data, we usually cannot identify thetrue DAG itself, but only its equivalence class. Every DAG in this equivalence class can bethe true causal structure.

With hidden or selection variables

When discovering causal relations from nonexperimental data, two difficulties arise. Oneis the problem of hidden (or latent) variables: Factors influencing two or more measuredvariables may not themselves be measured. The other is the problem of selection bias: Valuesof unmeasured variables or features may influence whether a unit is included in the datasample.

In the case of hidden or selection variables, one could still visualize the underlying causalstructure with a DAG that includes all observed, hidden and selection variables. However,when inferring the DAG from observational data, we do not know all hidden and selectionvariables.


We therefore seek to find a structure that represents all conditional independence relationshipsamong the observed variables given the selection variables of the underlying causal structure.It turns out that this is possible. However, the resulting object is in general not a DAGfor the following reason. Suppose, we have a DAG including observed, latent and selectionvariables and we would like to visualize the conditional independencies among the observedvariables only. We could marginalize out all latent variables and condition on all selectionvariables. It turns out that the resulting list of conditional independencies can in general notbe represented by a DAG, since DAGs are not closed under marginalization or conditioning(see Richardson and Spirtes 2002).

A class of graphical independence models that is closed under marginalization and condition-ing and that contains all DAG models is the class of ancestral graphs. A detailed discussionof this class of graphs can be found in Richardson and Spirtes (2002). In this text, we onlygive a brief introduction.

Ancestral graphs have nodes, which represent random variables and edges which representsome kind of dependence. The edges can be either directed ( or ), undirected ( )or bidirected ( ) (note that in the context of ancestral graphs, undirected and bidirectededges do not mean the same). There are two rules that restrict the direction of edges in anancestral graph:

1: If i and j are joined by an edge with an arrowhead at i, then there is no directed pathfrom i to j. (A path is a sequence of adjacent vertices, and a directed path is a pathalong directed edges that follows the direction of the arrowheads.)

2: There are no arrowheads present at a vertex which is an endpoint of an undirected edge.

Maximal ancestral graphs (MAG), which we will use from now on, also obey a third rule:

3: Every missing edge corresponds to a conditional independence.

The conditional independence statements of MAGs can be read off using the concept of m-separation, which is a generalization the concept of d-separation. Furthermore, part of thecausal information in the underlying DAG is represented in the MAG. If in the MAG thereis an edge between node i and node j with an arrowhead at node i, then there is no directedpath from node i to node j nor to any of the selection variables in the underlying DAG (i.e.,i is not a cause of j or of the selection variables). If, on the other hand, there is a tail at nodei, then there is a directed path from node i to node j or to one of the selection variables inthe underlying DAG (i.e., i is a cause of j or of a selection variable).

Recall that finding a unique DAG from an independence oracle is in general impossible.Therefore, one only reports on the equivalence class of DAGs in which the true DAG mustlie. The equivalence class is visualized using a CPDAG. The same is true for MAGs: Findinga unique MAG from an independence oracle is in general impossible. One only reports on theequivalence class in which the true MAG lies.

An equivalence class of a MAG can be uniquely represented by a partial ancestral graph(PAG) (see, e.g., Zhang 2008). A PAG contains the following types of edges: , , ,

, , . Roughly, the bidirected edges come from hidden variables, and the undirectededges come from selection variables. The edges have the following interpretation: (i) Thereis an edge between x and y if and only if Vx and Vy are conditionally dependent given VS for


all sets VS consisting of all selection variables and a subset of the observed variables; (ii) atail on an edge means that this tail is present in all MAGs in the equivalence class; (iii) anarrowhead on an edge means that this arrowhead is present in all MAGs in the equivalenceclass; (iv) a ◦-edgemark means that there is a at least one MAG in the equivalence class wherethe edgemark is a tail, and at least one where the edgemark is an arrowhead.

An algorithm for finding the PAG given an independence oracle is the FCI algorithm (“fastcausal inference”; see Spirtes et al. (2000) and Spirtes et al. (1999b)). The orientation rulesof this algorithm were slightly extended and proven to be complete in Zhang (2008). FCIis very similar to PC but makes additional conditional independence tests and uses moreorientation rules (see Section 3.4 for more details). We refer the reader to Zhang (2008) orColombo et al. (2012) for a detailed discussion of the FCI algorithm. It turns out that theFCI algorithm is computationally infeasible for large graphs. The RFCI algorithm (“reallyfast causal inference”; see Colombo et al. (2012)), is much faster than FCI. The outputof RFCI is in general slightly less informative than the output of FCI, in particular withrespect to conditional independence information. However, it was shown in Colombo et al.(2012) that any causal information in the output of RFCI is correct and that both FCI andRFCI are consistent in (different) sparse high-dimensional settings. Finally, in simulationsthe estimation performances of the algorithms are very similar.

Since both these algorithms are build up from the PC algorithm, they are also order-dependent,meaning that the output depends from the order in which the variables are given. Startingfrom the solution proposed for the PC algorithm, Colombo and Maathuis (2013) proposedseveral modifications of the FCI and the RFCI algorithms (see Sections 3.1, 3.4, and 3.5) thatpartly or fully remove these order-dependence issues in each of their steps.

From a mixture of observational and interventional data

We often have to deal with interventional data in causal inference. In cell biology for example,data is often measured in different mutants, or collected from gene knockdown experiments,or simply measured under different experimental conditions. An intervention, denoted byPearl’s do-calculus (see Section 1), changes the joint probability distribution of the system;therefore, data samples collected from different intervention experiments are not identicallydistributed (although still independent).

The algorithms PC and GES both rely on the i.i.d. assumption and are not suited for causalinference from interventional data. The GIES algorithm, which stands for “greedy interven-tional equivalence search”, is a generalization of GES to interventional data (see Hauser andBuhlmann 2012). It does not only make sure that interventional data points are handledcorrectly (instead of being wrongly treated as observational data points), but also accountsfor the improved identifiablity of causal models under interventional data by returning aninterventional essential graph. Just as in the observational case, an interventional essentialgraph is a partially directed graph representing an (interventional) Markov equivalence classof DAGs: a directed edge between two vertices stands for an arrow with common orientationamong all representatives of the equivalence class, an undirected edge stands for an arrow thathas different orientations among different representative DAGs; for more details, see Hauserand Buhlmann (2012).

GIES traverses the search space of interventional essential graphs in a similar way as GEStraverses the search space of observational essential graphs. In addition, a new search phase


was introduced by Hauser and Buhlmann (2012) with movements which correspond to turningsingle arrows in the space of DAGs.

2.2. Estimating bounds on causal effects

One way of quantifying the causal effect of variable Vx on Vy is to measure the state of Vy ifVx is forced to take value Vx = x and compare this to the value of Vy if Vx is forced to takethe value Vx = x+ 1 or Vx = x+ δ. If Vx and Vy are random variables, forcing Vx = x couldhave the effect of changing the distribution of Vy. Following the conventions in Pearl (2000),the resulting distribution after manipulation is denoted by P [Vy| do(Vx = x)]. Note that thisis different from the conditional distribution P [Vy|Vx = x]. To illustrate this, imagine thefollowing simplistic situation. Suppose we observe a particular spot on the street during somehour. The random variable Vx denotes whether it rained during that hour (Vx = 1 if it rained,Vx = 0 otherwise). The random variable Vy denotes whether the street was wet at the end ofthat hour (Vy = 1 if it was wet, Vy = 0 otherwise). If we assume P (Vx = 1) = 0.1 (rather dryregion), P (Vy = 1|Vx = 1) = 0.99 (the street is almost always still wet at the end of the hourwhen it rained during that hour) and P (Vy = 1|Vx = 0) = 0.02 (other reasons for making thestreet wet are rare), we can compute the conditional probability P (Vx = 1|Vy = 1) = 0.85.So, if we observe the street to be wet, the probability that there was rain in the last hour isabout 0.85. However, if we take a garden hose and force the street to be wet at a randomlychosen hour, we get P (Vx = 1| do(Vy = 1)) = P (Vx = 1) = 0.1. Thus, the distribution of therandom variable describing rain is quite different when making an observation versus makingan intervention.

Oftentimes, only the change of the target distribution under intervention is reported. We usethe change in mean, i.e., ∂

∂xE[Vy| do(Vx = x)], as a general measure for the causal effect of

Vx on Vy. For multivariate Gaussian random variables, E[Vy| do(Vx = x)] depends linearlyon x. Therefore, the derivative is constant which means that the causal effect does notdepend on x, and can also be interpreted as E[Vy| do(Vx = x + 1)] − E[Vy| do(Vx = x)].For binary random variables (with domain {0, 1}) we define the causal effect of Vx on Vy asE(Vy| do(Vx = 1))− E(Vy| do(Vx = 0)) = P (Vy = 1| do(Vx = 1))− P (Vy = 1| do(Vx = 0)).

The goal in the remainder of this section is to estimate the effect of an intervention if onlyobservational data is available.

Without hidden and selection variables

If the causal structure is a known DAG and there are no hidden and selection variables, Pearl(2000) (Th 3.4.1) suggested a set of inference rules known as “do-calculus” whose applica-tion transforms an expression involving a “do” into an expression involving only conditionaldistributions. Thus, information on the interventional distribution can be obtained by usinginformation obtained by observations and knowledge of the underlying causal structure.

Unfortunately, the causal structure is rarely known in practice. However, as discussed inSection 2.1, we can estimate the Markov equivalence class of the true causal DAG. Takingthis into account, we conceptually apply the do-calculus on each DAG within the equivalenceclass and thus obtain a possible causal effect for each DAG in the equivalence class (in practice,we developed a local method that is faster but yields a similar result; see Section 3.7 for moredetails). Therefore, even if we have an infinite amount of observations we can in general reporton a multiset of possible causal values (it is a multiset rather than a set because it can contain


duplicate values). One of these values is the true causal effect. Despite the inherent ambiguity,this result can still be very useful when the multiset has certain properties (e.g., all valuesare much larger than zero). These ideas are incorporated in the IDA method (Interventioncalculus when the DAG is Absent).

In addition to this fundamental limitation in estimating a causal effect, errors due to finitesample size blur the result as with every statistical method. Thus, we can typically only getan estimate of the set of possible causal values. It was shown that this estimate is consistentin sparse high-dimensional settings under some assumptions by Maathuis et al. (2009).

It has recently been shown empirically that despite the described fundamental limitationsin identifying the causal effect uniquely and despite potential violations of the underlyingassumptions, the method performs well in identifying the most important causal effects in ahigh-dimensional yeast gene expression data set (see Maathuis et al. 2010).

With hidden but no selection variables

If the causal DAG is known and no latent and selection variables are present, one can estimatecausal effects from observational data using for example Pearl’s backdoor criterion, as donein IDA.

However, in practice the assumption of no latent variables is often violated. Therefore,Maathuis and Colombo (2013) generalized Pearl’s backdoor criterion to more general typesof graphs that describe Markov equivalence classes of DAGs when allowing arbitrarily manylatent but no selection variables. This generalization works with DAGs, CPDAGs, MAGs,and PAGs as input and it is based on a two step approach. In a first step, the causal effect ofone variable X onto another variable Y under investigation is checked to be identifiable viathe generalized backdoor criterion, meaning that there exists a set of variables W for whichthe generalized backdoor criterion is satisfied with respect to X and Y in the given graph. Ifthe effect is indeed identifiable, in a second step the set W is explicitly given.

2.3. Summary of assumptions

For all proposed methods, we assume that the data is faithful to the unknown underlyingcausal DAG. For the individual methods, further assumptions are made.

PC algorithm: No hidden or selection variables; consistent in high-dimensional settings(the number of variables grows with the sample size) if the underlying DAG is sparse,the data is multivariate normal and satisfies some regularity conditions on the partialcorrelations, and α is taken to zero appropriately. See Kalisch and Buhlmann (2007)for full details. Consistency in a standard asymptotic regime with a fixed number ofvariables follows as a special case.

GES algorithm: No hidden or selection variables; consistency in a standard asymptoticregime with a fixed number of variables (see Chickering 2002).

FCI algorithm: Allows for hidden and selection variables; consistent in high-dimensionalsettings if the so-called Possible-D-SEP sets (see Spirtes et al. 2000) are sparse, thedata is multivariate normal and satisfies some regularity conditions on the partial corre-lations, and α is taken to zero appropriately. See Colombo et al. (2012) for full details.


Consistency in a standard asymptotic regime with a fixed number of variables followsas a special case.

RFCI algorithm: Allows for hidden and selection variables; consistent in high-dimensionalsettings if the underlying MAG is sparse (this is a much weaker assumption than the oneneeded for FCI), the data is multivariate normal and satisfies some regularity conditionson the partial correlations, and α is taken to zero appropriately. See Colombo et al.(2012) for full details. Consistency in a standard asymptotic regime with a fixed numberof variables follows as a special case.

GIES algorithm: No hidden or selection variables; mix of observational and interventionaldata. Interventional data alone is sufficient if there is no variable which is intervened inall data points.

IDA: No hidden or selection variables; all conditional expectations are linear; consistent inhigh-dimensional settings if the underlying DAG is sparse, the data is multivariate Nor-mal and satisfies some regularity conditions on the partial correlations and conditionalvariances, and α is taken to zero appropriately. See Maathuis et al. (2009) for fulldetails.

Generalized Backdoor Criterion: allows for arbitrarily many hidden but no selectionvariables. See Maathuis and Colombo (2013) for more details.

3. Package pcalg

This package has two goals. First, it is intended to provide fast, flexible and reliable imple-mentations of the PC, FCI, RFCI, GES and GIES algorithms for estimating causal structuresand graphical models. Second, it provides an implementation of the IDA method, which es-timates bounds on causal effects from observational data when no causal structure is knownand hidden or selection variables are absent, and it also provides a genralization of Pearl’sbackdoor criterion to DAGs, CPDAGs, MAGs, nad PAGs, when hidden but no selectionvariables are allowed.

In the following, we describe the main functions of our package for achieving these goals.The functions skeleton(), pc(), fci(), rfci(), ges(), gies() and simy() are intended forestimating graphical models. The functions ida() and idaFast() are intended for estimatingcausal effects from observational data, and the function backdoor() is intended for checkingif a causal effect is identifiable or not using the generalized backdoor criterion and if it isidentifiable for estimating a set that actually satisfies the generalized backdoor criterion.

Alternatives to this package for estimating graphical models in R include: Scutari (2010);Bottcher and Dethlefsen (2011); Hojsgaard (2012); Hojsgaard et al. (2012) and Hojsgaardand Lauritzen (2011).

3.1. skeleton

The function skeleton() estimates the skeleton of a DAG without latent and selection vari-ables using the PC algorithm (steps (P1) and (P2) in Algorithm 1), and it estimates an initial


> ## using data("gmG", package="pcalg")


> skel.gmG <- skeleton(suffStat, indepTest = gaussCItest,

p = ncol(gmG8$x), alpha = 0.01)

> par(mfrow = c(1,2))

> plot(gmG8$g, main = ""); plot(skel.gmG, main = "")

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

1

2

3

4

5

6

7

8

Figure 2: True underlying DAG (left) and estimated skeleton (right) fitted on the simulatedGaussian data set gmG.

skeleton of a DAG with arbitrarily many latent and selection variables using the FCI and theRFCI algorithms. The function can be called with the following arguments

skeleton(suffStat, indepTest, alpha, labels, p, method = c("stable", "original",

"stable.fast"), m.max = Inf, fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE,

numCores = 1, verbose = FALSE)

As was discussed in Section 2.1, the main task in finding the skeleton is to compute andtest several conditional independencies. To keep the function flexible, skeleton() takes asargument a function indepTest() that performs these conditional independence tests andreturns a p-value. All information that is needed in the conditional independence test canbe passed in the argument suffStat. The only exceptions are the number of variables p

and the significance level alpha for the conditional independence tests, which are passedseparately. For convenience, we have preprogrammed versions of indepTest() for Gaussiandata (gaussCItest()), discrete data (disCItest()), and binary data (binCItest()). Eachof these independence test functions needs different arguments as input, described in therespective help files. For example, when using gaussCItest(), the input has to be a listcontaining the correlation matrix and the sample size of the data. In the following code, weestimate the skeleton on the data set gmG (which consists of p = 8 variables and n = 5000samples) and plot the results. The estimated skeleton and the true underlying DAG are shownin Fig. 2.

To give another example, we show how to fit a skeleton to the example data set gmD (whichconsists of p = 5 discrete variables with 3, 2, 3, 4 and 2 levels and n = 10000 samples). The


1 2

3 4

5

1 2

3 4

5

Figure 3: True underlying DAG (left) and estimated skeleton (right) fitted on the simulateddiscrete data set gmD.

predefined test function disCItest() is based on the G2 statistic and takes as input a listcontaining the data matrix, a vector specifying the number of levels for each variable and anoption which indicates if the degrees of freedom must be lowered by one for each zero count.Finally, we plot the result. The estimated skeleton and the true underlying DAG are shownin Fig. 3.

In some situations, one may have prior information about the underlying DAG, for examplethat certain edges are absent or present. Such information can be incorporated into thealgorithm via the arguments fixedGaps (absent edges) and fixedEdges (present edges). Theinformation in fixedGaps and fixedEdges is used as follows. The gaps given in fixedGaps

are introduced in the very beginning of the algorithm by removing the corresponding edgesfrom the complete undirected graph. Thus, these edges are guaranteed to be absent in theresulting graph. Pairs (i, j) in fixedEdges are skipped in all steps of the algorithm, so thatthese edges are guaranteed to be present in the resulting graph.

If indepTest() returns NA and the option NAdelete is TRUE, the corresponding edge is deleted.If this option is FALSE, the edge is not deleted.

The argument m.max is the maximum size of the conditioning sets that are considered in theconditional independence tests.

Throughout, the function works with the column positions of the variables in the adjacencymatrix, and not with the names of the variables.

The PC algorithm is known to be order-dependent, in the sense that the output dependson the order in which the variables are given. Therefore, Colombo and Maathuis (2013)proposed a simple modification, called PC-stable, that yields order-independent adjacencies inthe skeleton. In this function we implement their modified algorithm (the old order-dependentimplementation can be found in version 1.1-5).

Since the FCI and RFCI algorithms are build up from the PC algorithm, they are also order-dependent in the skeleton. To resolve their order-dependence issues in the skeleton is more


involved, see Colombo and Maathuis (2013). However, this function estimates an initialorder-independent skeleton in these algorithms (for additional details on how to make thefinal skeleton of FCI fully order-independent see 3.4 and Colombo and Maathuis (2013)).

3.2. pc

The function pc() implements all steps (P1) to (P4) of the PC algorithm shown in displayalgorithm 1. First, the skeleton is computed using the function skeleton() (steps (P1) and(P2)). Then, as many edges as possible are oriented (steps (P3) and (P4)). The function canbe called as

pc(suffStat, indepTest, alpha, labels, p, fixedGaps = NULL, fixedEdges = NULL,

NAdelete = TRUE, m.max = Inf, u2pd = c("relaxed", "rand", "retry"),

skel.method = c("stable", "original", "stable.fast"), conservative = FALSE,

maj.rule = FALSE, solve.confl = FALSE, numCores = 1, verbose = FALSE)

where the arguments suffStat, indepTest, p, alpha, fixedGaps, fixedEdges, NAdeleteand m.max are identical to those of skeleton().

The conservative PC algorithm (conservative = TRUE) is a slight variation of the PC al-gorithm (see Ramsey et al. 2006). After the skeleton is computed, all unshielded tripletsa b c are checked in the following way. We test whether Va and Vc are independentconditioning on any subset of the neighbors of a or any subset of the neighbors of c. If b is inno such conditioning set (and not in the original sepset) or in all such conditioning sets (andin the original sepset), the triple is marked as unambiguous, no further action is taken andthe usual PC is continued. If, however, b is in only some conditioning sets, or if there was nosubset S of the adjacency set of a nor of c such that Va and Vc are conditionally independentgiven VS , the triple a b c is marked as ambiguous. An ambiguous triple is not oriented asa v-structure. Furthermore, no later orientation rule that needs to know whether a b c

is a v-structure or not is applied. Instead of using the conservative version, which is quitestrict towards the v-structures, Colombo and Maathuis (2013) introduced a less strict versionfor the v-structures called majority rule. This adaptation can be called using maj.rule =

TRUE. In this case, the triple a b c is marked as ambiguous if b is in exactly 50 percentof such conditioning sets, if it is in less than 50 percent it is set as a v-structure, and if in morethan 50 percent as a non v-structure, for more details see Colombo and Maathuis (2013). Theusage of both the conservative and the majority rule versions resolve the order-dependenceissues of the determination of the v-structures, see Colombo and Maathuis (2013) for moredetails.

Sampling errors (or hidden variables) can lead to conflicting information about edge directions.For example, one may find that a b c and b c d should both be directed asv-structures. This gives conflicting information about the edge b c, since it should bedirected as b c in v-structure a b c, while it should be directed as b c in v-structure b c d. With the option solve.confl = FALSE, in such cases, we simplyoverwrite the directions of the conflicting edge. In the example above this means that weobtain a b c d if a b c was visited first, and a b c d if b c d

was visited first, meaning that the final orientation on the edge depends on the ordering inwhich the edges are oriented. With the option solve.confl = TRUE (which is only supportedwith option u2pd = "relaxed"), we first generate a list of all (unambiguous) v-structures (inthe example above a b c and b c d), and then we simply orient them allow both


directions on the edge b c, namely we allow the bi-directed edge b c resolving the order-dependence issues on the edge orientations. We denote bi-directed edges in the adjacencymatrix M of the graph as M [b, c] = 2 and M [c, b] = 2. In a similar way using lists for thecandidate edges for each orientation rule and allowing bi-directed edges, the order-dependenceissues in the orientation rules can be solved. Note that bi-directed edges merely represents aconflicting orientation and they should not to be interpreted causally. The usage of these listsfor the candidate edges and allowing bi-directed edges resolve the order-dependence issueson the orientation of the v-structures and on the edges using the three orientation rules, seeColombo and Maathuis (2013) for more details.

Note that calling (conservative = TRUE) or maj.rule = TRUE, together with solve.confl

= TRUE produces a fully order-independent output, see Colombo and Maathuis (2013).

Sampling errors, non faithfulness, or hidden variables can also lead to invalid CPDAGs,meaning that there does not exist a DAG that has the same skeleton and v-structures asthe graph found by the algorithm. An example of this is an undirected cycle consisting of theedges a b c d and d a. In this case it is impossible to direct the edges withoutcreating a cycle or a new v-structure. The optional argument u2pd specifies what should bedone in such a situation. If it is set to "relaxed", the algorithm simply outputs the invalidCPDAG. If u2pd is set to "rand", all direction information is discarded and a random DAGis generated on the skeleton. The corresponding CPDAG is then returned. If u2pd is set to"retry", up to 100 combinations of possible directions of the ambiguous edges are tried, andthe first combination that results in a valid CPDAG is chosen. If no valid combination isfound, an arbitrary CPDAG is generated on the skeleton as with u2pd = "rand".

As with the skeleton, the PC algorithm works with the column positions of the variables inthe adjacency matrix, and not with the names of the variables. When plotting the object,undirected and bidirected edges are equivalent.

As an example, we estimate a CPDAG of the Gaussian data used in the example for theskeleton in Section 3.1. Again, we choose the predefined gaussCItest() as conditional in-dependence test and create the corresponding test statistic. Finally, we plot the result. Theestimated CPDAG and the true underlying DAG are shown in Fig. 4.

3.3. ges

The PC algorithm presented in the previous section is based on conditional independencetests. To apply it, we first had to calculate a sufficient statistic and to specify a conditionalindependence test function. For the score-based GES algorithm, we have to define a scoreobject before applying the inference algorithm.

A score object is an instance of a class derived from the base class Score. This base classis implemented as a virtual reference class. At the moment, the pcalg package only containsclasses derived from Score for Gaussian data: GaussL0penObsScore for purely observationaldata, and GaussL0penIntScore for a mixture of observational and interventional data; forthe GES algorithm, we only need the first one here, but we will encounter the second one inSection 3.6 again. The implementation of score classes for discrete data is planned for futureversions of the pcalg package. However, the flexible implementation using class inheritanceallows the user to implement own score classes for different scores.

The predefined score-class GaussL0penObsScore implements a an ℓ0-penalized maximum-likelihood estimator for observational data from a Gaussian causal model. A instance is



> pc.fit <- pc(suffStat, indepTest=gaussCItest, p = ncol(gmG8$x), alpha = 0.01)

> par(mfrow= c(1,2)); plot(gmG8$g, main = ""); plot(pc.fit, main = "")

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

1

2

3

4

5

6

7

8

Figure 4: True underlying DAG (left) and estimated CPDAG (right) fitted on the simulatedGaussian data set gmG.

generated as follows:

> score <- new("GaussL0penObsScore", data = matrix(1, 1, 1),

lambda = 0.5*log(nrow(data)), intercept = FALSE, use.cpp = TRUE, ...)

The data matrix is provided by the argument data. The penalization constant is specified bylambda. The default value of lambda corresponds to the BIC score; the AIC score is realizedby setting lambda to 1. The argument intercept indicates whether the model should allowfor intercepts in the linear structural equations, or for a non-zero mean of the multivariatenormal distribution, which is equivalent. The last argument use.cpp indicates whether theinternal C++ library should be used for calculation, which is in most cases the best choicedue to velocity issues.

Once a score object is defined, the GES algorithm is called as follows:

ges(score, labels = score$getNodes(), fixedGaps = NULL, adaptive = c("none",

"vstructures", "triples"), phase = c("forward", "backward", "turning"),

iterate = length(phase) > 1, turning = NULL, maxDegree = integer(0),

verbose = FALSE, ...)

The argument score is a score object defined before. The argument turning indicateswhether the novel turning phase (see Section 2.1) not present in the original implementa-tion of Chickering (2002) should be used, and maxdegree can be used to bound the vertexdegree of the estimated graph. More details can be found in the help file of ges().

In Fig. 5, we re-analyze the data set used in the example of Fig. 4 with the GES algorithm.The estimated graph is exactly the same in this case.


> score <- new("GaussL0penObsScore", gmG8$x)

> ges.fit <- ges(score)

> par(mfrow=1:2); plot(gmG8$g, main = ""); plot(ges.fit$essgraph, main = "")

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

Figure 5: True underlying DAG (left) and essential graph (right) estimated with the GESalgorithm fitted on the simulated Gaussian data set gmG.

3.4. fci

The FCI algorithm is a generalization of the PC algorithm, in the sense that it allows arbi-trarily many latent and selection variables. Under the assumption that the data are faithfulto a DAG that includes all latent and selection variables, the FCI algorithm estimates theequivalence class of MAGs that describe the conditional independence relationships betweenthe observed variables given the selection variables.

The first part of the FCI algorithm is analogous to the PC algorithm. It starts with acomplete undirected graph and estimates an initial skeleton using the function skeleton(),which produces an initial order-independent skeleton. All edges of this skeleton are of the form

. However, due to the presence of hidden variables, it is no longer sufficient to consideronly subsets of the adjacency sets of nodes x and y to decide whether the edge x y shouldbe removed. Therefore, the initial skeleton may contain some superfluous edges. These edgesare removed in the next step of the algorithm which requires some orientations. Therefore, thev-structures are determined using the conservative method (see discussion on conservative

below). All potential v-structures a b c are checked in the following way. We testwhether Va and Vc are independent conditioning on any subset of the neighbors of a or anysubset of the neighbors of c. If b is in no such conditioning set or in all such conditioning sets,no further action is taken. If, however, b is in only some conditioning sets, the triple a b c

is marked as ambiguous. If Va is independent of Vc given some S in the skeleton (i.e., the edgea c dropped out), but Va and Vc remain dependent given all subsets of neighbors of eithera or c, we will call all such triples a b c unambiguous. This is because in the FCI, thetrue separating set might be outside the neighborhood of either a or c. An ambiguous tripleis not oriented as a v-structure. After the v-structures have been oriented, Possible-D-SEPsets for each node in the graph are computed at once. To decide whether edge x y should


be removed, one performs conditional independence tests of Vx and Vy given all subsets ofPossible-D-SEP(x) and of Possible-D-SEP(y) (see help file of function pdsep()). The edge isremoved if a conditional independence is found. This will produce a fully order-independentfinal skeleton as explained in Colombo and Maathuis (2013). Subsequently, all edges aretransformed into again and the v-structures are newly determined (using information insepset). Finally, as many undetermined edge marks (o) as possible are determined using (asubset of) the 10 orientation rules given by Zhang (2008).

The function can be called with the following arguments:

fci(suffStat, indepTest, alpha, labels, p, skel.method = c("stable",

"original", "stable.fast"), type = c("normal", "anytime", "adaptive"),

fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf,

pdsep.max = Inf, rules = rep(TRUE, 10), doPdsep = TRUE, biCC = FALSE,

conservative = FALSE, maj.rule = FALSE, numCores = 1, verbose = FALSE)

where the arguments suffStat, indepTest, p, alpha, fixedGaps, fixedEdges, NAdeleteand m.max are identical to those in skeleton().

The argument pdsep.max indicates the maximum size of Possible-D-SEP for which subsetsare considered as conditioning sets in the conditional independence tests. If the nodes x and y

are adjacent in the graph and the size of Possible-D-SEP(x)\{x,y} is bigger than pdsep.max

the edge is simply left in the graph. Note that if pdsep.max is less than Inf, the final PAGmay be a supergraph than the one computed with pdsep.max = Inf, because less tests mayhave been performed in the former.

The option rules contains a logical vector of length 10 indicating which rules should be usedwhen directing edges, where the order of the rules is taken from Zhang (2008).

The option doPdsep indicates whether Possible-D-SEP should be computed for all nodes,and all subsets of Possible-D-SEP are considered as conditioning sets in the conditional in-dependence tests, if not defined otherwise in pdsep.max. If FALSE, Possible-D-SEP is notcomputed, so that the algorithm simplifies to the Modified PC algorithm of Spirtes et al.(2000).

By setting the argument biCC = TRUE, Possible-D-SEP(a, c) is defined as the intersection ofthe original Possible-D-SEP(a, c) and the set of nodes that lie on a path between a and c.This method uses biconnected components to find all nodes on a path between nodes a and c.The smaller Possible-D-SEP sets lead to faster computing times, while Colombo et al. (2012)showed that the algorithm is identical to the original FCI algorithm given oracle informationon the conditional independence relationships.

Conservative versions of FCI, Anytime FCI, and Adaptive Anytime FCI (see below) arecomputed if the argument of conservative is TRUE. After the final skeleton is computed,all potential v-structures a b c are checked in the following way. We test whether Va

and Vc are independent conditioning on any subset of the neighbors of a or any subset ofthe neighbors of c. When a subset makes Va and Vc conditionally independent, we call it aseparating set. If b is in no such separating set or in all such separating sets, no further actionis taken and the normal version of the FCI, Anytime FCI, or Adaptive Anytime FCI algorithmis continued. If, however, b is in only some separating sets, the triple a b c is markedambiguous. If Va is independent of Vc given some S in the skeleton (i.e., the edge a c

dropped out), but Va and Vc remain dependent given all subsets of neighbors of either a or c,we will call all triples a b c unambiguous. This is because in the FCI algorithm, the


true separating set might be outside the neighborhood of either a or c. An ambiguous tripleis not oriented as a v-structure. Furthermore, no further orientation rule that needs to knowwhether a b c is a v-structure or not is applied. Instead of using the conservative version,which is quite strict towards the v-structures, Colombo and Maathuis (2013) introduced aless strict version for the v-structures called majority rule. This adaptation can be calledusing maj.rule = TRUE. In this case, the triple a b c is marked as ambiguous if andonly if b is in exactly 50 percent of such separating sets or no separating set was found. If bis in less than 50 percent of the separating sets it is set as a v-structure, and if in more than50 percent it is set as a non v-structure, for more details see Colombo and Maathuis (2013).Colombo and Maathuis (2013) showed that with both these modifications, the final skeletonand the decisions about the v-structures of the FCI algorithm are fully order-independent.

Using the argument labels, one can pass names for the vertices of the estimated graph. Bydefault, this argument is set to NA, in which case the node names as.character(1:p) areused.

The argument type specifies the version of the FCI that has to be used. Per default itis normal and so the normal FCI algorithm is called. If set as anytime, the Anytime FCISpirtes (2001) is called and in this case m.max must be specified by the user. The Anytime FCIalgorithm can be viewed as a modification of the FCI algorithm that only performs conditionalindependence tests up to and including order m.max when finding the initial skeleton, usingthe function skeleton, and the final skeleton, using the function pdsep. Thus, Anytime FCIperforms fewer conditional independence tests than FCI. If set as adaptive, the AdaptiveAnytime FCI Colombo et al. (2012) is called and in this case m.max is not used. The firstpart of the algorithm is identical to the normal FCI described above. But in the second partwhen the final skeleton is estimated using the function pdsep, the Adaptive Anytime FCIalgorithm only performs conditional independence tests up to and including order m.max,where m.max is the maximum size of the conditioning sets that were considered to determinethe initial skeleton using the function skeleton.

As an example, we estimate the PAG of a graph with five nodes using the function fci() andthe predefined function gaussCItest() as conditional independence test. In Fig. 6 the trueDAG and the PAG estimated with fci() are shown. Random variable V1 is latent. We canread off that both V4 and V5 cannot be a cause of V2 and V3, which can be confirmed in thetrue DAG.

3.5. rfci

The function rfci() is rather similar to fci(). However, it does not compute any Possible-D-SEP sets and thus does not make tests conditioning on them. This makes rfci() muchfaster than fci(). The orientation rule for v-structures and the orientation rule for so-calleddiscriminating paths (rule 4) were modified in order to produce a PAG which, in the oracleversion, is guaranteed to have correct ancestral relationships.

The function can be called in the following way:

rfci(suffStat, indepTest, alpha, labels, p, skel.method = c("stable",

"original", "stable.fast"), fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE,

m.max = Inf, rules = rep(TRUE, 10), conservative = FALSE, maj.rule = FALSE,

numCores = 1, verbose = FALSE)

where the arguments suffStat, indepTest, p, alpha, fixedGaps, fixedEdges, NAdelete


1 2

3 4

5

2 3

4

5Figure 6: True underlying DAG (left) and estimated PAG (right), when applying the FCIand RFCI algorithms to the data set gmL. The output of FCI and RFCI is identical. VariableV1 of the true underlying DAG is latent.

and m.max are identical to those in skeleton().

The argument rules is similar to the one in fci but modified to produce a PAG with correctancestral relationships, in the oracle version.

The first part of the RFCI algorithm is analogous to the PC and FCI algorithm. It starts witha complete undirected graph and estimates an initial skeleton using the function skeleton,which produces an initial order-independent skeleton, see skeleton for more details. Alledges of this skeleton are of the form . Due to the presence of hidden variables, it is nolonger sufficient to consider only subsets of the neighborhoods of nodes x and y to decidewhether the edge x o-o y should be removed. The FCI algorithm performs independencetests conditioning on subsets of Possible-D-SEP to remove those edges. Since this procedureis computationally infeasible, the RFCI algorithm uses a different approach to remove someof those superfluous edges before orienting the v-structures and the discriminating paths inorientation rule 4.

Before orienting the v-structures, we perform the following additional conditional indepen-dence tests. For each unshielded triple a b c in the initial skeleton, we check if both Va

and Vb and Vb and Vc are conditionally dependent given the separating of a and c (sepset(a, c)).These conditional dependencies may not have been checked while estimating the initial skele-ton, since sepset(a, c) does not need to be a subset of the neighbors of a nor of the neighborsof c. If both conditional dependencies hold and b is not in the sepset(a, c), the triple isoriented as a v-structure a b c. On the other hand, if an additional conditional inde-pendence relationship may be detected, say Va is independent from Vb given the sepset(a, c),the edge between a and c is removed from the graph and the set responsible for that is savedin sepset(a, b). The removal of an edge can destroy or create new unshielded triples in thegraph. To solve this problem we work with lists (for details see Colombo et al. 2012).

Before orienting discriminating paths, we perform the following additional conditional in-


dependence tests. For each triple a b c with a c, the algorithm searches fora discriminating path p = 〈d, ..., a, b, c〉 for b of minimal length, and checks that the ver-tices in every consecutive pair (Vf , Vg) on p are conditionally dependent given all subsets ofsepset(d, c) \ Vf , Vg. If we do not find any conditional independence relationship, the path isoriented as in rule (R4). If one or more conditional independence relationships are found, thecorresponding edges are removed, their minimal separating sets are stored.

Conservative RFCI can be computed if the argument of conservative is TRUE. After thefinal skeleton is computed and the additional local tests on all unshielded triples, as describedabove, have been done, all potential v-structures a b c are checked in the following way.We test whether Va and Vc are independent conditioning on any subset of the neighbors of a orany subset of the neighbors of c. When a subset makes Va and Vc conditionally independent,we call it a separating set. If b is in no such separating set or in all such separating sets,no further action is taken and the normal version of the RFCI algorithm is continued. If,however, b is in only some separating sets, the triple a b c is marked ambiguous. If Va

is independent of Vc given some S in the skeleton (i.e., the edge a c dropped out), but Va

and Vc remain dependent given all subsets of neighbors of either a or c, we will call all triplesa b c unambiguous. This is because in the RFCI algorithm, the true separating setmight be outside the neighborhood of either a or c. An ambiguous triple is not oriented as av-structure. Furthermore, no further orientation rule that needs to know whether a b c

is a v-structure or not is applied. Instead of using the conservative version, which is quitestrict towards the v-structures, Colombo and Maathuis (2013) introduced a less strict versionfor the v-structures called majority rule. This adaptation can be called using maj.rule =

TRUE. In this case, the triple a b c is marked as ambiguous if and only if b is in exactly50 percent of such separating sets or no separating set was found. If b is in less than 50percent of the separating sets it is set as a v-structure, and if in more than 50 percent it isset as a non v-structure, for more details see Colombo and Maathuis (2013).

The implementation uses the stabilized skeleton skeleton(), which produces an initial order-independent skeleton. The final skeleton and edge orientations can still be order-dependent,see Colombo and Maathuis (2013).

As an example, we re-run the example from Section 3.4 and show the PAG estimated withrfci() in Figure 6. The PAG estimated with fci() and the PAG estimated with rfci() arethe same.

> data("gmL")

> suffStat1 <- list(C = cor(gmL$x), n = nrow(gmL$x))

> pag.est <- rfci(suffStat1, indepTest = gaussCItest,

p = ncol(gmL$x), alpha = 0.01, labels = as.character(2:5))

A note on implementation: As pc(), fci() and rfci() are similar in the result they produce,their resulting values are of (S4) class pcAlgo and fciAlgo (for both fci() and rfci()),respectively. Both classes extend the class (of their “communalities”) gAlgo.

3.6. gies and simy

As we noted in Section 2.1, the GIES algorithm is a generalization of the GES algorithm to amix of interventional and observational data. Hence the usage of gies() is similar to that ofges() (see Section 3.3). Actually, the function ges() is only an internal wrapper for gies().


A data set with jointly interventional and observational data points is not i.i.d. In order touse it for causal inference, we must specify the intervention target each data point belongsto. This is done by specifying the arguments target and target.index when generating aninstance of GaussL0penIntScore (see Section 3.3):

> score <- new("GaussL0penIntScore", data = matrix(1, 1, 1),

targets = list(integer(0)), target.index = rep(as.integer(1), nrow(data)),

lambda = 0.5*log(nrow(data)), intercept = FALSE, use.cpp = TRUE, ...)

The argument targets is a list of all (mutually different) targets that have been intervenedin the experiments generating the data set. The allocation of sample indices to interventiontargets is specified by the argument target.index. This is an integer vector whose first entryspecifies the index of the intervention target in the list targets of the first data point, whosesecond entry specifies the target index of the second data point, and so on. An example isgiven in Figure 7.

Once a score object for interventional data is defined, the GIES algorithm is called as follows:

gies(score, labels = score$getNodes(), targets = score$getTargets(),

fixedGaps = NULL, adaptive = c("none", "vstructures", "triples"), phase = c("forward",

"backward", "turning"), iterate = length(phase) > 1, turning = NULL,

maxDegree = integer(0), verbose = FALSE, ...)

Most arguments coincide with those of ges() (see Section 3.3). The only additional argumentis targets: it takes the same list of (unique) intervention targets as the constructor of theclass GaussL0penIntScore (see above). This list of targets specifies the space of correspondinginterventional Markov equivalence classes or essential graphs (see Section 2.1).

The data set gmInt consists of 5000 data points sampled from the DAG in Figure 2, amongthem 3000 observational ones, 1000 originating from an intervention at vertex 3 and 1000originating from an intervention at vertex 5. It can be loaded by calling

> data(gmInt)

The underlying causal model (or its interventional essential graph, respectively) is estimatedin Figure 7.

As an alternative to GIES, we can also use the dynamic programming approach of Silanderand Myllymaki (2006) to estimate the interventional essential graph from this interventionaldata set. This algorithm is implemented in the function simy() which has the same argumentsas gies(). As noted in Section 2.1, this approach yields the exact optimum of the BIC scoreat the price of a computational complexity which is exponential in the number of variables.On the small example based on 8 variables, using this algorithm is feasible; however, it is notfeasible for more than approximately 20 variables, depending on the processor and memory ofthe machine. In this example, we get exactly the same result as with gies() (see Figure 7).

3.7. ida

To illustrate the function ida(), consider the following example. We simulated 10000 samplesfrom seven multivariate Gaussian random variables with a causal structure given on the leftof Fig. 8. We assume that the causal structure is unknown and want to infer the causal effectof V2 on V5. First, we estimate the equivalence class of DAGs that describe the conditionalindependence relationships in the data, using the function pc() (see Section 3.2).

> data("gmI")


> score <- new("GaussL0penIntScore", gmInt$x, targets = gmInt$targets,

target.index = gmInt$target.index)

> gies.fit <- gies(score)

> simy.fit <- simy(score)

> par(mfrow = c(1, 3)) ; plot(gmInt$g, main = "")

> plot(gies.fit$essgraph, main = "")

> plot(simy.fit$essgraph, main = "")

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

Author

Bar

Ctrl

Goal

V5

V6

V7

V8

Figure 7: The underlying DAG (left) and the essential graph estimated with the GIES algo-rithm (middle) and the dynamic programming approach of Silander and Myllymaki (2006)(right) applied on the simulated interventional Gaussian data set gmInt. This data set con-tains data from interventions at vertices 3 and 5; accordingly, the orientation of all arrowsincident to these two vertices becomes identifiable (see also Figure 5 for comparison with theobservational case.


1

2

3

4

56

7

1

2

3

4

56

7

Figure 8: True DAG (left) and estimated CPDAG (right).

> suffStat <- list(C = cor(gmI$x), n = nrow(gmI$x))

> pc.gmI <- pc(suffStat, indepTest=gaussCItest,

p = ncol(gmI$x), alpha = 0.01)

Comparing the true DAG with the CPDAG in Fig. 8, we see that the CPDAG and thetrue DAG have the same skeleton. Moreover, the directed edges in the CPDAG are alsodirected in that way in the true DAG. Three edges in the CPDAG are bidirected. Recall thatundirected and bidirected edges bear the same meaning in a CPDAG, so they can be usedinterchangeably.

Since there are three undirected edges in the CPDAG, there might be up to 23 = 8 DAGsin the corresponding equivalence class. However, the undirected edges 2 3 5 can beoriented as a new v-structure. As mentioned in Section 2.1, DAGs in the equivalence classmust have exactly the same v-structures as the corresponding CPDAG. Thus, 2 3 5can only be oriented as 2 3 5, 2 3 5 or 2 3 5, and not as 2 3 5.There is only one remaining undirected edge (1 6), which can be oriented in two ways.Thus, there are six valid DAGs (i.e., they have no new v-structures and no directed cycles)and these form the equivalence class represented by the CPDAG. In Fig. 9, all DAGs in theequivalence class are shown. DAG 6 is the true DAG.

For each DAG G in the equivalence class, we apply Pearl’s do-calculus to estimate the totalcausal effect of Vx on Vy. Since we assume Gaussianity, this can be done via a simple linearregression: If y is not a parent of x, we take the regression coefficient of Vx in the regressionlm(Vy ~ Vx + pa(Vx)), where pa(Vx) denotes the parents of x in the DAG G (z is called aparent of x if G if G contains the edge z x); if y is a parent of x in G, we set the estimatedcausal effect to zero.

If the equivalence class contains k DAGs, this yields k estimated total causal effects. Sincewe do not know which DAG is the true causal DAG, we do not know which estimated totalcausal effect of Vx on Vy is the correct one. Therefore, we return the entire multiset of kestimated effects.


DAG 1

123

45

67

DAG 2

1

2

3 4

56

7DAG 3

1

23

4

567

DAG 4

123

45

67

DAG 5

1

2

3 4

56

7

DAG 6

12

34

567

Figure 9: All DAGs belonging to the same equivalence class as the true DAG shown in Fig. 6.


In our example, there are six DAGs in the equivalence class. Therefore, the function ida()

(with method = "global") produces six possible values of causal effects, one for each DAG.

> ida(2, 5, cov(gmI$x), pc.gmI@graph, method = "global", verbose = FALSE)

[1] -0.0049012 -0.0049012 0.5421360 -0.0049012 -0.0049012 0.5421360

Among these six values, there are only two unique values: −0.0049 and 0.5421. This is becausewe compute lm(V5 ~ V2 + pa(V2)) for each DAG and report the regression coefficient of V2.Note that there are only two possible parent sets of node 2 in all six DAGs: In DAGs 3 and6, there are no parents of node 2. In DAGs 1, 2, 4 and 5, however, the parent of node 2 isnode 3. Thus, exactly the same regressions are computed for DAGs 3 and 6, and the sameregressions are computed for DAGs 1, 2, 4 and 5. Therefore, we end up with two uniquevalues, one of which occurs twice, while the other occurs four times.

Since the data was simulated from a model, we know that the true value of the causal effectof V2 on V5 is 0.5529. Thus, one of the two unique values is indeed very close to the truecausal effect (the slight discrepancy is due to sampling error).

The function ida() can be called as

ida(x.pos, y.pos, mcov, graphEst, method = c("local", "optimal",

"global"), y.notparent = FALSE, verbose = FALSE, all.dags = NA, type = c("cpdag",

"pdag"))

where x.pos denotes the column position of the cause variable, y.pos denotes the columnposition of the effect variable, mcov is the covariance matrix of the original data, and graphEst

is a graph object describing the causal structure (this could be given by experts or estimatedby pc()).

If method="global", the method is carried out as described above, where all DAGs in theequivalence class of the estimated CPDAG are computed. This method is suitable for smallgraphs (say, up to 10 nodes). The DAGs can (but need not) be precomputed using thefunction allDags() and passed via argument all.dags.

If method="local", we do not determine all DAGs in the equivalence class of the CPDAG.Instead, we only consider the local neighborhood of x in the CPDAG. In particular, weconsider all possible directions of undirected edges that have x as endpoint, such that no newv-structure is created. For each such configuration, we estimate the total causal effect of Vx

on Vy as above, using linear regression.

At first sight, it is not clear that such a local configuration corresponds to a DAG in theequivalence class of the CPDAG, since it may be impossible to direct the remaining undirectededges without creating a directed cycle or a v-structure. However, Maathuis et al. (2009)showed that there is at least one DAG in the equivalence class for each such local configuration.As a result, it follows that the multisets of total causal effects of the global and the local

method have the same unique values. They may, however, have different multiplicities.

Recall, that in the example using the global method, we obtained two unique values withmultiplicities two and four yielding six numbers in total. Applying the local method, weobtain the same unique values, but the multiplicities are 1 for both values.

> ida(2,5, cov(gmI$x), pc.gmI@graph, method = "local")

[1] 0.5421360 -0.0049012

One can take summary measures of the multiset. For example, the minimum absolute value


provides a lower bound on the size of the true causal effect: If the minimum absolute valueof all values in the multiset is larger than one, then we know that the size of the true causaleffect (up to sampling error) must be larger than one. The fact that the unique values of themultisets of the global and local method are identical implies that summary measures ofthe multiset that only depend on the unique values (such as the minimum absolute value)can be found using the local method.

In some applications, it is clear that some variable is definitively a cause of other variables.Consider for example a bacterium producing a certain substance, taking the amount of pro-duced substance as response variable. Knocking out genes in the bacterium might changethe ability to produce the substance. By measuring the expression levels of genes, we wantto know which genes have a causal effect on the product. In this setting, it is clear thatthe amount of substance is the effect and the activity of the genes is the cause. Thus in thecausal structure, the response variable cannot be a parent node of any variable describing theexpression level of genes. This background knowledge can be easily incorporated: By settingthe option y.notparent = TRUE, all edges in the CPDAG that have the response variable asendpoint (no matter whether directed or undirected) are overwritten so that they are orientedtowards the response variable.

3.8. idaFast

In some applications it is desirable to estimate the causal effect of one variable on a set ofresponse variables. In these situations, the function idaFast() should be used. Imaginefor example, that we have data on several variables, that we have no background knowledgeabout the causal effects among the variables and that we want to estimate the causal effectof each variable onto each other variable. To this end, we could consider for each variablethe problem: What is the causal effect of this variable on all other variables. Of course, onecould solve the problem by using ida() on each pair of variables. However, there is a moreefficient way which uses the fact that a linear regression of a fixed set of explanatory variableson several different response variables can be computed very efficiently.

The function idaFast() can be called with the following arguments

idaFast(x.pos, y.pos.set, mcov, graphEst)

The arguments x.pos, mcov, graphEst have the same meaning as the corresponding argu-ments in ida(). The argument y.pos.set is a vector containing the column positions of allresponse variables of interest.

This call performs ida(x.pos, y.pos, mcov, graphEst, method="local", y.notparent=

FALSE, verbose=FALSE) for all values of y.pos in y.pos.set at the same time and in anefficient way. Note that the option y.notparent = TRUE is not implemented.

Consider the example from Section 3.7, where we computed the causal effect of V2 on V5.Now, we want to compute the effect of V2 on V5, V6 and V7 using idaFast() and comparethe results with the output of ida(). As expected, both methods lead to the same results.

> (eff.est1 <- ida(2,5, cov(gmI$x), pc.gmI@graph, method="local"))

[1] 0.5421360 -0.0049012


[1] -0.0058532 -0.0062310



[1] 1.00861 0.89708

> (eff.estF <- idaFast(2, c(5,6,7), cov(gmI$x), pc.gmI@graph))

[,1] [,2]

5 0.5421360 -0.0049012

6 -0.0058532 -0.0062310

7 1.0086138 0.8970766

3.9. backdoor

This function is a generalization of Pearl’s backdoor criterion, see Pearl (1993), defined fordirected acyclic graphs (DAGs), for single interventions and single outcome variable to moregeneral types of graphs (CPDAGs, MAGs, and PAGs) that describe Markov equivalenceclasses of DAGs with and without latent variables but without selection variables, for moredetails see Maathuis and Colombo (2013).

The motivation to find a set W that satisfies the generalized backdoor criterion with respectto X and Y in the given graph relies on the result of the generalized backdoor adjustment thatsays: “If a set of variables W satisfies the generalized backdoor criterion relative to X and Y inthe given graph, then the causal effect of X on Y is identifiable and is given by:”P (Y |do(X =x)) =

∑W P (Y |X,W ).P (W ). This result allows to write post-intervention densities (the one

written using Pearl’s do-calculus) using only observational densities estimated from the data.

This function can be called in the following way:

backdoor(amat, x, y, type = "pag", max.chordal = 10, verbose = FALSE)

where amat is the adjacency matrix of the given graph, x denotes the column position of thecause variable, y denotes the column position of the effect variable, and mcov is the covariancematrix of the original data.

The argument type specifies the type of graph of the given adjacency matrix in amat. If theinput graph is a DAG (type="dag"), this function reduces to Pearl’s backdoor criterion forsingle interventions and single outcome variable, and the parents of X in the DAG satisfiesthe backdoor criterion unless Y is a parent of X. Therefore, if Y is a parent of X, there isno set W that satisfies the generalized backdoor criterion relative to X and Y in the DAGand NA is output. Otherwise, the causal effect is identifiable and a set W that satisfies thegeneralized backdoor criterion relative to X and Y in the DAG is given. If the input graphis a CPDAG C (type="cpdag"), a MAG M , or a PAG P (with both M and P not allowingselection variables), this function first checks if the total causal effect of X on Y is identifiablevia the generalized backdoor criterion (see Maathuis and Colombo (2013), Theorem 4.1). Ifthe effect is not identifiable, the output is NA. Otherwise, an explicit set W that satisfies thegeneralized backdoor criterion relative to X and Y in the given graph is found.

Note that if the set W is equal to the empty set, the output is NULL.

At this moment this function is not able to work with PAGs estimated using the rfci Algo-rithm.

It is important to note that there can be pair of nodes x and y for which there is no set W thatsatisfies the generalized backdoor criterion, but the total causal effect might be identifiablevia some other technique.


1 2

3 4

5 6

1 2

3 4

5 6

Figure 10: True DAG (left) and estimated CPDAG (right).

To illustrate this function, we use as example the CPDAG displayed in Figure 4, page 15 ofMaathuis and Colombo (2013). The R-code below is used to generate a DAG g that belongsto the required equivalence class which is uniquely represented by the estimated CPDAGmyCPDAG.

> p <- 6

> amat <- t(matrix(c(0,0,1,1,0,1, 0,0,1,1,0,1, 0,0,0,0,1,0,

0,0,0,0,1,1, 0,0,0,0,0,0, 0,0,0,0,0,0), 6,6))

> V <- as.character(1:6)

> colnames(amat) <- rownames(amat) <- V

> edL <- vector("list",length=6)

> names(edL) <- V

> edL[[1]] <- list(edges=c(3,4,6),weights=c(1,1,1))

> edL[[2]] <- list(edges=c(3,4,6),weights=c(1,1,1))

> edL[[3]] <- list(edges=5,weights=c(1))

> edL[[4]] <- list(edges=c(5,6),weights=c(1,1))

> g <- new("graphNEL", nodes=V, edgeL=edL, edgemode="directed")

> cov.mat <- trueCov(g)

> myCPDAG <- dag2cpdag(g)

> true.amat <- as(myCPDAG, "matrix")

> ## true.amat[true.amat != 0] <- 1

The DAG g and the CPDAG myCPDAG are shown in Figure 10.

Now, we want to check if the effect of V6 on V3 in the given CPDAG is identifiable usingbackdoor() and if this is the case know which set W satisfies the generalized backdoorcriterion. As explained in Example 4 in Maathuis and Colombo (2013), the causal effect ofV6 on V3 in the CPDAG myCPDAG is identifiable via the generalized backdoor criterion andthere is a set W that satisfies the generalized criterion:

> backdoor(true.amat, 6, 3, type="cpdag")


[1] 1 2

3.10. Using a user specific conditional independence test

In some cases it might be desirable to use a user specific conditional independence test insteadof the provided ones. The pcalg package allows the use of any conditional independencetest defined by the user. In this section, we illustrate this feature by using a conditionalindependence test for Gaussian data that is not predefined in the package.

The functions skeleton(), pc() and fci() all need the argument indepTest, a functionof the form indepTest(x, y, S, suffStat) to test conditional independence relationships.This function must return the p-value of the conditional independence test of Vx and Vy givenVS and some information on the data in the form of a sufficient statistic (this might be simplythe data frame with the original data), where x, y, S indicate column positions of the originaldata matrix. We will show an example that illustrates how to construct such a function.

A simple way to compute the partial correlation of Vx and Vy given VS for some data is to solvethe two associated linear regression problems lm(Vx ∼ VS) and lm(Vy ∼ VS), get the residuals,and calculate the correlation between the residuals. Finally, a correlation test between theresiduals yields a p-value that can be returned. The argument suffStat is an arbitrary objectcontaining several pieces of information that are all used within the function to producethe p-value. In the predefined function gaussCItest() for example, a list containing thecorrelation matrix and the number of observations is passed. This has the advantage thatany favorite (e.g., robust) method of computing the correlation matrix can be used beforepartial correlations are computed. Oftentimes, however, it suffices to just pass the completedata set in suffStat. We choose this simple method in our example. An implementation ofthe function myCItest() could look like this.

> myCItest <- function(x,y,S, suffStat) {

if (length(S) == 0) {

x. <- suffStat[,x]

y. <- suffStat[,y]

} else {

rxy <- resid(lm.fit(y= suffStat[,c(x,y)], x= cbind(1, suffStat[,S])))

x. <- rxy[,1]; y. <- rxy[,2]

}

cor.test(x., y.)$p.value

}

We can now use this function together with pc().

> pc.myfit <- pc(suffStat = gmG8$x, indepTest = myCItest,

p = 8, alpha = 0.01)

> par(mfrow = c(1,2)); plot(pc.gmG, main = ""); plot(pc.myfit, main = "")

As expected, the resulting CPDAG (see Fig. 11) is the same as in Section 3.2 where weused the function gaussCItest() as conditional independence test. Note however that usinggaussCItest() is considerably faster than using myCItest() (on our computer 0.059 secondsusing gaussCItest() versus 1.05 seconds using myCItest()).


1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

Figure 11: The estimated CPDAGs using the predefined conditional independence testgaussCItest() (left) and the user specified conditional independence test myCItest() (right)are identical for the gmG data.

4. Applications

The pcalg package has been used for applications in epidemiology (see Kalisch et al. 2010),biology (see Maathuis et al. 2010; Nagarajan et al. 2010) and the social sciences (see Danenbergand Marsella 2010). We will discuss two applications in more detail below.

4.1. Graphical Models and Causal Effects in Human Functioning

In Kalisch et al. (2010), the development of WHO’s International Classification of Functioning,Disability and Health (ICF) on the one hand and recent developments in graphical modelingon the other hand were combined to deepen the understanding of human functioning. Theobjective of the paper was to explore how graphical models can be used in the study ofICF data. It was found that graphical models could be used successfully for visualizationof the dependence structure of the data set, dimension reduction, and the comparison ofsubpopulations. Moreover, estimations of bounds on causal effects using the IDA methodyielded plausible results. All analyses were done with the pcalg package.

4.2. Causal effects among genes

In Maathuis et al. (2010), the authors aim at quantifying the effects of single gene interventionson the expression of other genes in yeast, allowing for better insights into causal relationsbetween genes. With n = 63 samples of observational data measuring the expression ofp = 5361 genes (see Hughes et al. 2000), the goal was to identify the largest interventioneffects between all pairs of genes. For the analysis, the pcalg package with version 1.1-5 wasused.

Hughes et al. (2000) also provide gene expression measurements from 234 interventional ex-periments, namely from 234 single-gene deletion mutant strains. Using this data, we know


the true causal effect of the knock-out genes on the remaining genes in good approximation.We can then quantify how well we can find the true intervention effects in the following way:We encode the largest 10% of the intervention effects computed from the interventional dataas the target set of effects that we want to identify. We then check in an ROC curve, how wellthe ranking of the causal effects estimated by applying ida() to the observational data is ableto identify effects in the target set. For comparison, the authors also used the (conceptuallywrong) Lasso and Elastic Net to obtain rankings. In Figure 12 one can see that ida(), usingthe function skeleton() with version 1.1-5, is clearly superior to the alternative methods(and random guessing) in terms of identifying effects in the target set. In Figure 13 one cansee that ida() using the stable function skeleton() produces even better results than withthe old version.

We note that the yeast data set is very high-dimensional (n = 63, p = 5361). Thus, unlike thetoy examples used to illustrate the package in this manuscript, where n was much bigger than p

and the causal structure was recovered exactly up to its equivalence class, the estimated causalstructure for the yeast data is likely to contain many sampling errors. However, Figure 12shows that it is still possible to extract useful information about causal effects.

5. Discussion

Causal structure learning and the estimation of causal effects from observational data haslarge potential. However, we emphasize that we do not propose causal inference methodsbased on observational data as a replacement for experiments. Rather, IDA should be usedas a guide for prioritizing experiments, especially in situations where no clear preferencesbased on the context can be given.

Data from cell biology is usually interventional data: different measurements could originatefrom different mutants or different cell stems, or from experiments with gene knockout orknockdown experiments. The GIES algorithm is designed for causal inference from data ofthis kind.

Since many assumptions of the proposed methods are uncheckable, it is important to furthervalidate the methods in a range of applications. We hope that the pcalg package contributesto this important issue by providing well-documented and easy to use software.


False positives

Tru

e p

ositiv

es

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1,000IDA

Lasso

Elastic−net

Random

Figure 12: The largest 10% of the causal effects found in experiments among yeast genesare identified much better from observational data with IDA than with Lasso, Elastic Net orrandom guessing. The figure is essentially taken from Maathuis et al. (2010).

References

Andersson SA, Madigan D, Perlman MD (1997). “A characterization of Markov equivalenceclasses for acyclic digraphs.” Annals of Statistics, 25(2), 505–541.

Bottcher SG, Dethlefsen C (2011). deal: Learning Bayesian Networks with Mixed Variables.R package version 1.2-34, URL http://CRAN.R-project.org/package=deal.

Chickering DM (1996). “Learning Bayesian networks is NP-complete.” Learning from data:Artificial intelligence and statistics V, 112, 121–130.

Chickering DM (2002). “Optimal structure identification with greedy search.” Journal ofMachine Learning Research, 3(3), 507–554.

Colombo D, Maathuis MH (2013). “Order-independent constraint-based causal structre learn-ing.” ArXiv preprint 1211.3295. URL http://arxiv.org/abs/1211.3295.

http://CRAN.R-project.org/package=deal

http://arxiv.org/abs/1211.3295


False positives

0 1,000 2,000 3,000 4,000

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

Tru

e p

ositiv

es

IDA

Lasso

Elastic−net

Random

Figure 13: The largest 10% of the causal effects found in experiments among yeast genesare identified much better from observational data with IDA than with Lasso, Elastic Net orrandom guessing. The figure is essentially taken from Colombo and Maathuis (2013).

Colombo D, Maathuis MH, Kalisch M, Richardson TS (2012). “Learning High-Dimensionaldrected acyclic graphs with latent and selection variables.” Annals of Statistics, 40(1),294–321.

Danenberg P, Marsella S (2010). “Data-Driven Coherence Models.” In Proceedings of the 19thConference on Behavior Representation in Modeling and Simulation.

Geiger D, Heckerman D, King H, Meek C (2001). “Stratified exponential families: graphicalmodels and model selection.” Annals of Statistics, 29(2), 505–529.

Grimmett GR, McDiarmid CJH (1975). “On colouring random graphs.” Mathematical Pro-ceedings of the Cambridge Philosophical Society, 77, 313–324.

Haughton DDM (1988). “On the choice of a model to fit data from an exponential family.”Annals of Statistics, 16(1), 342–355.

Hauser A, Buhlmann P (2012). “Characterization and greedy learning of interventional


Markov equivalence classes of directed acyclic graphs.” Journal of Machine Learning Re-search, 13, 2409–2464.

Hojsgaard S (2012). gRain: Graphical Independence Networks. R package version 0.8-12,URL http://CRAN.R-project.org/package=gRain.

Hojsgaard S, Dethlefsen C, Bowsher C (2012). gRbase: A package for graphical modellingin R. R package version 1.4.4, URL http://CRAN.R-project.org/package=gRbase.

Hojsgaard S, Lauritzen SL (2011). gRc: Inference in Graphical Gaussian Models with Edgeand Vertex Symmetries. R package version 0.3.0, URL http://CRAN.R-project.org/

package=gRc.

Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E,Dai H, He Y, Kidd M, King A, Meyer M, Slade D, Lum P, Stepaniants S, Shoemaker D,Gachotte D, Chakraburtty K, Simon J, Bard M, Friend S (2000). “Functional Discoveryvia a Compendium of Expression Profiles.” Cell, 102, 109–126.

Kalisch M, Buhlmann P (2007). “Estimating High-Dimensional Directed Acyclic Graphs withthe PC-Algorithm.” Journal of Machine Learning Research, 8, 613–636.

Kalisch M, Fellinghauer B, Grill E, Maathuis MH, Mansmann U, Buhlmann P, Stucki G(2010). “Understanding Human Functioning Using Graphical Models.” BMC Medical Re-search Methodology, 10:14.

Kalisch M, Machler M, Colombo D, Maathuis MH, Buhlmann P (2012). “Causal InferenceUsing Graphical Models with the R Package pcalg.” Journal of Statistical Software, 47(11),1–26. ISSN 1548-7660. URL http://www.jstatsoft.org/v47/i11.

Lauritzen S (1996). Graphical Models. Oxford University Press.

Maathuis MH, Colombo D (2013). “A generalized backdoor criterion.” ArXiv e-prints. 1307.5636, URL http://arxiv.org/abs/1307.5636.

Maathuis MH, Colombo D, Kalisch M, Buhlmann P (2010). “Predicting Causal Effects inLarge-Scale Systems from Observational Data.” Nature Methods, 7, 261–278.

Maathuis MH, Kalisch M, Buhlmann P (2009). “Estimating High-Dimensional InterventionEffects from Observational Data.” Annals of Statistics, 37, 3133–3164.

Meek C (1995). “Strong Completeness and Faithfulness in Bayesian Networks.” In Proceed-ings of 11th Conference on Uncertainty in Artificial Intelligence (UAI 1995), pp. 411–418.Morgan Kaufmann.

Nagarajan R, Datta S, Scutari M, Beggs ML, Nolen GT, Peterson CA (2010). “FunctionalRelationships between Genes Associated with Differentiation Potential of Aged MyogenicProgenitors.” Frontiers in Physiology, 1.

Pearl J (1993). “Comment: Graphical models, causality and intervention.” Statistical Science,8, 266–269.

Pearl J (2000). Causality. Cambridge University Press.

http://CRAN.R-project.org/package=gRain

http://CRAN.R-project.org/package=gRbase

http://CRAN.R-project.org/package=gRc

http://CRAN.R-project.org/package=gRc

http://www.jstatsoft.org/v47/i11

1307.5636

1307.5636

http://arxiv.org/abs/1307.5636


R Development Core Team (2014). R: A Language and Environment for Statistical Comput-ing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URLhttp://www.R-project.org.

Ramsey J, Zhang J, Spirtes P (2006). “Adjacency-Faithfulness and Conservative CausalInference.” In Proceedings of the 22nd Conference Annual Conference on Uncertainty inArtificial Intelligence (UAI 2006), pp. 401–408. AUAI Press, Arlington, Virginia.

Richardson TS, Spirtes P (2002). “Ancestral Graph Markov Models.” Annals of Statistics,30, 962–1030.

Scutari M (2010). “Learning Bayesian Networks with the bnlearn R Package.” Journal ofStatistical Software, 35(3), 1–22. URL http://www.jstatsoft.org/v35/i03/.

Silander T, Myllymaki P (2006). “A simple approach for finding the globally optimal Bayesiannetwork structure.” In Proceedings of the 22nd Conference on Uncertainty in ArtificialIntelligence (UAI 2006), pp. 445–452.

Spirtes P (2001). “An anytime algorithm for causal inference.” In Proc. of the Eighth Interna-tional Workshop on Artificial Intelligence and Statistics, pp. 213–221. Morgan Kaufmann,San Francisco.

Spirtes P, Glymour C, Scheines R (2000). Causation, Prediction, and Search. AdaptiveComputation and Machine Learning, second edition. MIT Press, Cambridge.

Spirtes P, Meek C, Richardson T (1999a). Computation, Causation and Discovery, chapterAn Algorithm for Causal Inference in the Presence of Latent Variables and Selection Bias,pp. 211–252. MIT Press.

Spirtes P, Meek C, Richardson TS (1999b). An Algorithm for Causal Inference in the Presenceof Latent Variables and Selection Bias, pp. 211–252. MIT Press.

Verma T, Pearl J (1990). “Equivalence and Synthesis of Causal Models.” In Proceedings ofthe 6th Conference on Uncertainty in Artificial Intelligence (UAI 1990), pp. 255–270.

Zhang J (2008). “On the completeness of orientation rules for causal discovery in the presenceof latent confounders and selection bias.” Artificial Intelligence, 172, 1873–1896.

Affiliation:

Markus KalischSeminar fur StatistikETH Zurich8092 Zurich, SwitzerlandE-mail: [email protected]

http://www.R-project.org

http://www.jstatsoft.org/v35/i03/

mailto:[email protected]

Documents

More Causal Inference with Graphical Models in R …...More Causal Inference with Graphical Models in R Package pcalg Markus Kalisch ETH Zurich Martin M¨achler ETH Zurich Diego Colombo