Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Part VCausal Discovery 2: Linear, non-Gaussian Models
• “Independence mechanism” instantiation 2: Independent noise condition
• Causal discovery based on structural equation models: linear non-Gaussian case
• Extensions to deal with confounders/cycles
Fully Identifiable Causal Structure? Two-Variable Case.
• Structural equation model / functional causal model
• Related to this type of “independence”:
• Start with the linear case
• Determine causal direction in the two-variable case? Identifiability!
funcX
E
Y
P(X)→X→P(Y|X)
Y→
⫫
Y = f(X,E), where E ?? X
Y = aX + E, where E ?? X
X Y ------------- -1.1 1.0 2.1 2.0 3.1 4.2
2.3 -0.6
1.3 2.2 -1.8 0.9 ... ....
X Y
X Y
X Y
or
orZ
Gaussian vs. Non-Gaussian Distributions
−4 −2 0 2 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8Three distribusions with zero mean and unit variance
GaussianLaplacianUniform 50 100 150 200 250 300 350 400 450 500
−2
0
2
Guassian
50 100 150 200 250 300 350 400 450 500−4−2024
Laplacian
0 50 100 150 200 250 300 350 400 450 500−2
0
2
Uniform
Causal Asymmetry the Linear Case: Illustration
EXY EY
XX Y
X
Y
X
Y
EX
X
Y
X
Y
Y
Gaussian case
Uniform case
Linear regression Y = aX + EYLinear regression X = bY + EX
Data generated by Y = aX + E (i.e., X →Y):
EY
Super-Gaussian CaseData generated by Y = aX + E (X →Y):
X
Y
X
Y Y
EX
X
E
X
EY
More Generally, LiNGAM Model• Linear, non-Gaussian, acyclic causal model (LiNGAM) (Shimizu et al., 2006):
• Disturbances (errors) Ei are non-Gaussian (or at most one is Gaussian) and mutually independent
• Example:X2 X3
X1
0.5
-0.2 0.3E2 E3
E1
X2 = E2,
X3 = 0.5X2 + E3,
X1 = �0.2X2 + 0.3X3 + E1.
Xi =X
j: parents of i
bijXj + Ei or X = BX+E
Shimizu et al. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.
Identifiability of Causal Direction in the Linear Case
• Supported by the “independent component analysis” theory
• Later will be approached in a more general nonlinear setting, and you’ll see the linear-Gassian case is one of the few non-identifiable situations
Darmois-Skitovich Theorem & Identifiability of Causal Direction in Two-Variable CaseDarmois-Skitovitch theorem: Define two random variables,
Y1 and Y2, as linear combinations of independent random variablesSi, i = 1, ..., n:
Y1 = ↵1S1 + ↵2S2 + ...+ ↵nSn,
Y2 = �1S1 + �2S2 + ...+ �nSn.
If Y1 and Y2 are statistically independent, then all variables Sj forwhich ↵j�j 6= 0 are Gaussian.
Kagan et al., Characterization Problems in Mathematical Statistics. New York: Wiley, 1973
Generated by Y = aX + E (X →Y):
Assuming Y →X (fitting X = bY + EY):
1 b0 1
�·EY
Y
�=
=
XY
�)
EY
Y
�=
1� ab �b
a 1
�·EX
�
☓
0 11 a
�·EX
�
Independent Component Analysis
X1
Xm
observed signals
ICA system
output: as independent as
possible
W
… … Y1
Yn
de-mixing
estimate
A
… …s1
sn
unknown mixing system
independent sources
mixing
…
X = A·S Y = W·X
A
… …S1
Sn
unknown mixing system
independent sources
mixing
• Assumptions in ICA
• At most one of Si is Gaussian
• #Source >= # Sensor, and A is of full column rank
.5 .3 1.1 �0.3 ....8 �.7 .3 .5 ...
�=
? ?? ?
�·? ? ? ? ...? ? ? ? ...
�
Hyvärinen et al., Independent Component Analysis, 2001
Then A can be estimated up to column scale and permutation
indeterminacies
As1
s2
X1
X2
Recovering Data-Generating Process
X1
Xm
observed signals
…• This is a causal process, but may be less interesting because Sk
are usually not observable/manipulable
• When Sk is identical to Xj, we have causal relations among Xj
S1
S2
X = A·S
Intuition: Why ICA works?• (After preprocessing) ICA aims to find a
rotation transformation Y = W·X to making Yi independent
• By maximum likelihood log p(X|A), mutual information MI(Y1,...,Ym) minimization, infomax...
X1
X2
X1
X2
Y1
Y2
X1
X2
Y1
Y2
X1
X2
Y1
Y2
How ICA works? By Maximum Likelihood• From a maximum likelihood perspective
• To be maximized by the gradient-based method or natural-gradient based method
X = A·SY = W·XpS = ⇧n
i=1pSi
)pX = ⇧ni=1pSi(W
|i X)/|A|
)TX
t=1
log pX(Xt) =TX
t=1
nX
i=1
log pSi(W|i X
t) + T log |W|
(Xt: the t-th point of X.)
Probabilities: Product and Sum Rules Are Fundamental 1. P(X,Y|I) = P(X|Y, I)P(Y|I) 2. P(X|I) = ∑YP(X,Y|I) (I: background information. In the continuous case the sum is replaced by an integral.)
How ICA works? By Mutual Information Minimization
• Mutual information I(Y1,...,Yn) is the Kullback-Leiber divergence from PY to ∏iPYi :
• Nonnegative and zero iff Yi are independent
• H(·): differential entropy--how random the variable is?
Hyvärinen et al., Independent Component Analysis...
I(Y1, ..., Yn) =
Z. . .
ZpY1,...,Yn log
pY1,...,Yn
pY1 ...pYn
dy1...dyn
=
Z. . .
ZpY1,...,Yn log pY1,...,Yndy1...dyn �
ZpY1,...,Yn
nX
i=1
log pYidyi
=X
i
H(Yi)�H(Y )
=X
i
H(Yi)�H(X)� log |W| because Y = WX
How ICA works? Some Interpretation
• Some methods (e.g., FastICA, JADE) pre-whiten the data, and then aim to find a rotation, for which |W| = 1
• Minimizing I ⇔ minimizing the entropies
• Given the variance, the Gaussian distribution has the largest entropy (among all continuous distributions)
• Maximizing non-Gaussianity !
• FastICA adopts some approximations of negentropy of each output Yi
I(Y1, ..., Yn) =X
i
H(Yi)�H(X)� log |W| =X
i
H(Yi) + const.
Non-Gaussianity is Informative in the Linear Case
• Smaller entropy, more structural, more interesting
• “Purer” according to the central limit theorem
198 ICA BY MAXIMIZATION OF NONGAUSSIANITY
Fig. 8.26 An illustration of projection pursuit and the “interesting” directions. The data inthis figure is clearly divided into two clusters. The goal in projection pursuit is to find theprojection (here, on the horizontal axis) that reveals the clustering or other structure of thedata.
the other hand, the projection on the vertical direction, which is also the direction ofthe first principal component, fails to show this structure. This also shows that PCAdoes not use the clustering structure. In fact, clustering structure is not visible in thecovariance or correlation matrix on which PCA is based.
Thus projection pursuit is usually performed by finding the most nongaussianprojections of the data. This is the same thing that we did in this chapter to estimate theICA model. This means that all the nongaussianity measures and the correspondingICA algorithms presented in this chapter could also be called projection pursuit“indices” and algorithms.
It should be noted that in the formulation of projection pursuit, no data modelor assumption about independent components is made. If the ICA model holds,optimizing the ICA nongaussianity measures produce independent components; ifthe model does not hold, then what we get are the projection pursuit directions.
8.6 CONCLUDING REMARKS AND REFERENCES
A fundamental approach to ICA is given by the principle of nongaussianity. Theindependent components can be found by finding directions in which the data ismaximally nongaussian. Nongaussianity can be measured by entropy-based mea-sures or cumulant-based measures like kurtosis. Estimation of the ICA model canthen be performed by maximizing such nongaussianity measures; this can be doneby gradient methods or by fixed-point algorithms. Several independent componentscan be found by finding several directions of maximum nongaussianity under theconstraint of decorrelation.
Which direction is more interesting?
Hyvärinen et al., Independent Component Analysis, 2001
A Demo of the ICA
Procedure
Why Gaussianity Was Widely Used?
• Central limit theorem: An illustration
• “Simplicity” of the form; completely characterized by mean and covariance
• Marginal and conditionals are also Gaussian
• Has maximum entropy, given values of the mean and the covariance matrix
E. T. Jaynes. Probability Theory: The Logic of Science. 1994. Chapter 7.
−0.5 0 0.50
20
40
60
80
100
120hist(Ui)
−1 −0.5 0 0.5 10
50
100
150
200hist( (U1+U2)/sqrt(2) )
−1 −0.5 0 0.5 10
50
100
150
200hist( (U1+U2+U3)/sqrt(3) )
Gaussianity or Non-Gaussianity?
• Non-Gaussianity is actually ubiquitous
• Linear closure property of Gaussian distribution: If the sum of any finite independent variables is Gaussian, then all summands must be Gaussian (Cramér, 1970)
• Gaussian distribution is “special” in the linear case
• Practical issue: How non-Gaussian they are?
LiNGAM Analysis by ICA • LiNGAM:
• B has special structure: acyclic relations
• ICA: Y = WX
• B can be seen from W by permutation and re-scaling
• Faithfulness assumption avoided
• E.g., 2
4E1
E3
E2
3
5 =
2
41 0 0
�0.5 1 00.2 �0.3 1
3
5 ·
2
4X2
X3
X1
3
5
,
8><
>:
X2 = E1
X3 = 0.5X2 + E3
X1 = �0.2X2 + 0.3X3 + E2
X2 X3
X1
0.5
-0.2 0.3
So we have the causal relation:W
Xi =X
j: parents of i
bijXj + Ei or X = BX+E ⇒ E = (I-B)X
Question 1. How to find W?
Question 2. How to see B from W?
LiNGAM Analysis by ICA • LiNGAM:
• B has special structure: acyclic relations
• ICA: Y = WX
• B can be seen from W by permutation and re-scaling
• Faithfulness assumption avoided
• E.g., 2
4E1
E3
E2
3
5 =
2
41 0 0
�0.5 1 00.2 �0.3 1
3
5 ·
2
4X2
X3
X1
3
5
,
8><
>:
X2 = E1
X3 = 0.5X2 + E3
X1 = �0.2X2 + 0.3X3 + E2
X2 X3
X1
0.5
-0.2 0.3
So we have the causal relation:W
Xi =X
j: parents of i
bijXj + Ei or X = BX+E ⇒ E = (I-B)X
Question 1. How to find W?
Question 2. How to see B from W?
Can You See Causal Relations fromW?• ICA gives Y = WX and
• Can we find the causal model?
2. Then divide each row of Ẅ by its diagonal entry, giving Ẅ’. 3. B = I� W0 .
W =
2
664
0.6 �0.4 2 01.5 0 0 00 0.2 0 0.51.5 3 0 0
3
775
* LiNGAM: X = BX+E or E = (I-B)X; ICA gives Y = WX
* So W is a row-permuted and re-scaled version of (I-B). Can we determine B uniquely?
* (I-B): All diagonal entries are 1. So we have the procedure ⤴
* Uniqueness? implied by acyclicity of the causal relations (B can be permuted to strictly lower-triangularity, i.e., if Xi follow the causal ordering, B is strictly lower-triangular
1. First permute the rows of W to make all diagonal entries non-zero, yielding Ẅ.
ICA-Based LiNGAM: A Real Example
−2 −1 0 1 2−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Height
Arm
spa
n A(:, 1)
A(:, 2)
For small-scale problems, we can compare the
dependence between the residual & hypothetical
cause in both directions !
Y1
Y2
�= W ·
height
arm span
�, where
W =
1.33 �0.39�1.56 2.02
�, or
A = W�1 =
0.97 0.190.75 0.64
�
If W1,2 = 0, then
height =1
1.33Y1,
arm span =1.56
2.02height +
1
2.02Y2.
Independence Test / Dependence Measure
• Measure: mutual information MI(Y1,Y2) ≥0 with equality holds iff Y1⫫Y2
• Statistical test for independence
• Y1⫫Y2 if and only if all functions of them are uncorrelated
• The functional space can be narrowed down to the reproducing kernel Hilbert space
• HSIC independence test; Kernel-based (conditional) independence test; other tests also exist
Gretton et al. (2008). A kernel statistical test of independence. In Advances in Neural Information Processing Systems, 585–592.
Zhang et al. (2011). Kernel-based conditional independence test and application in causal discovery. In Proc. UAI, 804–813.
50 100 150 200 250 300 350 400−100
−50
0
50
100
Height
residuals
50 100 150 200 250 300 350 400100
150
200
250
300
350
400
450
Height
Arm
spa
n
100 150 200 250 300 350 400 450−150
−100
−50
0
50
100
150
Arm spam
residuals
100 150 200 250 300 350 400 45050
100
150
200
250
300
350
400
Arm span
Hei
ght
p-value for ind test. is 0.93; MI = 0.002
p-value for ind test. is 0.53; MI = 0.042
Real Examples: By Checking Independence in Both Directions
Some Estimation Methods for LiNGAM
• ICA-LiNGAM
• ICA with Sparse Connections
• DirectLiNGAM...
Shimizu et al. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.
Zhang et al. (2006) ICA with sparse connections: Revisited. Lecture Notes in Computer Science, 5441:195–202, 2009
Shimizu, et al. (2011). DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12:1225–1248.
ICA-Based LiNGAM: A Demonstration
X2 X3
X1
0.5
-0.6 0.3
X4
0.8 -0.5
What’s the output of PC? LiNGAM?
(with uniformly distributed noise)
Data is in‘LiNGAM_4variables.txt’
Practical Issues beyond LiNGAM…• Confounding (SGS 1993; Hoyer et al., 2008)
• Feedback (Lacerda et a;., 2008; Richardson 1996)
• Nonlinearities (Zhang & Chan, ICONIP’06; Hoyer et al., NIPS’08; Zhang & Hyvärinen, UAI’09, Huang et al., KDD’18)
• Causality in time series
• Time-delayed + instantaneous relations (Hyvarinen ICML’08; Zhang et al., ECML’09; Hyvarinen et al., JMLR’10)
• Subsampling / temporally aggregation (Danks & Plis, NIPS WS’14; Gong et al., ICML’15 & UAI’17)
• From partially observable time series (Geiger et al., ICML’15)
• Nonstationary/heterogeneous data (Zhang et al., IJCAI’17; Huang et al, ICDM’17)
• Measurement error (Zhang et al., UAI’18; PSA’18)
• Selection bias (Zhang et al., UAI’16) 29
Are They Confounders ?
X2
X3
X1
1.2E1
E3
X5
X4
3
2
-0.3
E2
E5
E40.5X1? X4? X2? X5?
Identifiability of Overcomplete ICA
• More independent sources than observed variables, i.e., n>m
Theorem: Suppose the random vector X = (X1, ..., Xm)| isgenerated by X = AS, where the components of S, S1, ..., Sn, arestatistically independent. Even when n > m, the columns of A arestill identifiable up to a scale transformation if
• all Si are non-Gaussian, or
• A is of full column rank and at most one of Si is Gaussian.
X1
Xm
observed signals
A
… …s1
sn
unknown mixing system
independent sources
mixing
…
Kagan et al., Characterization Problems in Mathematical Statistics. New York: Wiley, 1973Eriksson and Koivunen (2004). Identifiability, Separability and Uiiiqueness of Linear ICA Models, IEEE
Signal Processing Lett.: vol. 11, no. 7, pp. GOI-604, Jul. 2004.
n>m
.5 .3 1.1 �0.3 ....8 �.7 .3 .5 ...
�=
? ? ?? ? ?
�·
2
4? ? ? ? ...? ? ? ? ...? ? ? ? ...
3
5As1s2s3
X1
X2
Overcomplete ICA: Illustration
−20 −15 −10 −5 0 5 10 15 20 25 30−30
−20
−10
0
10
20
30
X1
X 2
X1
X2
�=
0.8 0.4 �0.9 00.3 0.8 0.8 1
�·
2
664
S1
S2
S3
S4
3
775
What if they are Gaussian?
Causal Discovery under Confounders
• Can we see the causal direction ?
• Can we determine a3 ? a1 and a2 ?
• Observationally equivalent model:
X1 X2
Za1 a2
a3
E1 E2
X1 X2
E11 -a2/a1
a3+a2/a1
a1˙Z E2
X1
X2
�=
"1 0 1
(a3 +a2a1
⌘+ �a2
a11
⇣a3 +
a2a1
⌘#·
2
4E1
E2
a1Z
3
5
X1
X2
�=
1 0 a1a3 1 a1a3 + a2
�·
2
4E1
E2
Z
3
5 =
1 0 1a3 1 a3 +
a2a1
�·
2
4E1
E2
a1Z
3
5
Hoyer et al. (2008). Estimation of causal effects using linear nonGaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49(2):362– 378.
Confounders: ExampleX1 X2
Z
a1=0.6 a2=0.8
a3=0.5
E1 E2
X1
X2
�=
1 0 a1a3 1 a1a3 + a2
�·
2
4E1
E2
Z
3
5 =
1 0 1a3 1 a3 +
a2a1
�·
2
4E1
E2
a1Z
3
5
−20−10
010
2030
−20−10
010
20
−20
−15
−10
−5
0
5
10
ZE1
E 2
−30 −20 −10 0 10 20 30−30
−20
−10
0
10
20
30
X1
X 2
What If There Are Cycles/Loops?
X2
X3
X4
2-1
-0.3
Cycles• Causal relations may have cycles; Consider an example
X1 → X2
Lacerda, Spirtes, Ramsey and Hoyer (2008). Discovering cyclic causal models by independent component analysis. In Proc. UAI.
X2
X3
X1
1.2E1
E3
X5
X4
3
2-1
-0.3
E2
E5
E4
X1 = E1
X2 = 1.2X1 � 0.3X4 + E2
X3 = 2X2 + E3
X4 = �X3 + E4
X5 = 3X2 + E5
Or in matrix form, X = BX+E, where
B =
2
66664
0 0 0 0 0
1.2 0 0 �0.3 0
0 2 0 0 0
0 0 �1 0 0
0 3 0 0 0
3
77775
A conditional-independence-based method is given in T. Richardson (1996) - A Polynomial-Time Algorithm for Deciding Markov Equivalence of Directed Cyclic Graphical Models. Proc. UAI
Why Cycles?• Some situations where we can recover cycles with ICA:
• Each process reaches its equilibrium state & we observe the equilibrium states of multiple processes
• On temporally aggregated data
X1 → X2
X1,t-1
X2,t-1
X1,t
X2,t
X1,t+1
X2,t+1
...
... ...
...
B B
Xt = BXt�1 + Et.
At convergence we have Xt = Xt�1 for eachdynamical process, so
Xt = BXt + Et, or Et = (I�B)Xt.
Suppose the underlying process is Xt = BXt�1 + Et, but we just observeXt =
1L
PLk=1 Xt+k. Since
1
L
LX
k=1
Xt+k = B1
L
LX
k=1
Xt+k�1 +1
L
LX
k=1
Et+k.
We have Xt = BXt +Et as L ! 1.
✗ ✗
✗✗
Examples• Some situations where we can recover cycles with ICA:
• Each process reaches its equilibrium state & we observe the equilibrium states of multiple processes
• On temporally aggregated data
X1 → X2
X1,t-1
X2,t-1
X1,t
X2,t
X1,t+1
X2,t+1
...
... ...
...
B BConsider the price and demand of the same
product in di↵erent states:
pricet = b1 · pricet�1 + b2 · demandt�1 + E1
demandt = b3 · pricet�1 + b4 · demandt�1 + E2
Suppose the underlying process is Xt = BXt�1 + Et, but we just observeXt =
1L
PLk=1 Xt+k.
Consider the causal relation between two stocks: the causal influence takesplace very quickly (⇠ 1-2 minutes) but we only have daily returns.
Can We Recover Cyclic Relations?
• E = (I-B)X; ICA can give Y = WX
• Without cycles: unique solution to B
• With cycles: solutions to B not unique any more; why? :-(
• A 2-D example?
• Only one solution is stable (assuming no self-loops), i.e., s.t. |product of coefficients over the cycle| < 1 :-)
Summary:1. Still m independent components;2. W cannot be permuted to be lower-triangular
X2
X1
aE2
E1 b
Suppose we have the process
Xt =
0 ba 0
�
| {z }B
Xt + Et.
That is,
(I�B)X = E, or
1 �b�a 1
�Xt = Et
)�a 11 �b
�Xt =
0 11 0
�· Et
)
1 �1/a�1/b 1
�Xt =
0 �1/a
�1/b 0
�· Et
)Xt =
0 1/a1/b 0
�
| {z }B0
Xt +
0 �1/a
�1/b 0
�· Et.
W
W’
X1 = bX2 + E1
X2 = aX1 + E2
+
X2 =1
bX1 �
1
bE1
X1 =1
aX2 �
1
aE2
Can You Find the Alternative Causal Model ?
• For this example...
X1 → X2
X1 = E1
X2 = 1.2X1 � 0.3X4 + E2
X3 = 2X2 + E3
X4 = �X3 + E4
X5 = 3X2 + E5
Or in matrix form, X = BX+E, where
B =
2
66664
0 0 0 0 0
1.2 0 0 �0.3 0
0 2 0 0 0
0 0 �1 0 0
0 3 0 0 0
3
77775
I�B =
2
66664
1 0 0 0 0�1.2 1 0 0.3 00 �2 1 0 00 0 1 1 00 �3 0 0 1
3
77775.
W0 =
2
66664
1 0 0 0 00 �2 1 0 00 0 1 1 0
�1.2 1 0 0.3 00 �3 0 0 1
3
77775. That is,
B0 =
2
66664
0 0 0 0 00 0 0.5 0 00 0 0 �1 04 �3.3 0 0 00 3 0 0 0
3
77775.
X2
X3
X1
1.2E1
E3
X5
X4
3
2-1
-0.3
E2
E5
E4
Can You Find the Alternative Causal Model ?
• For this example...
X1 → X2
X1 = E1
X2 = 1.2X1 � 0.3X4 + E2
X3 = 2X2 + E3
X4 = �X3 + E4
X5 = 3X2 + E5
Or in matrix form, X = BX+E, where
B =
2
66664
0 0 0 0 0
1.2 0 0 �0.3 0
0 2 0 0 0
0 0 �1 0 0
0 3 0 0 0
3
77775
I�B =
2
66664
1 0 0 0 0�1.2 1 0 0.3 00 �2 1 0 00 0 1 1 00 �3 0 0 1
3
77775.
W0 =
2
66664
1 0 0 0 00 �2 1 0 00 0 1 1 0
�1.2 1 0 0.3 00 �3 0 0 1
3
77775. That is,
B0 =
2
66664
0 0 0 0 00 0 0.5 0 00 0 0 �1 04 �3.3 0 0 00 3 0 0 0
3
77775.
X2
X3
X1
1.2E1
E3
X5
X4
3
2-1
-0.3
E2
E5
E4 X2
X3
X14
E1
E’3
X5
X4
3
0.5-1
-3.3
E’2
E5
E’4
X2 = 1.2X1 � 0.3X4 + E2 can be rewritten as X4 =1.2
0.3X1 �
1
0.3X2 +
1
0.3E2 ...
Some Simulation Result
• Simulate 15000 data points with non-Gaussian noise using this model:
• Output of the algorithm:X2
X3
X1
1.2E1
E3X5
X4
3
2-1
-0.3
E2
E5
E4
Fig. 3: The output of LiNG-D: Candidate #1 and Candi-date #2
assuming linearity and no dependence between errorterms:
• DGs G1 and G2 are zero partial correlation equiv-alent if and only if the set of zero partial correla-tions entailed for all values of the free parameters(non-zero linear coe�cients, distribution of the er-ror terms) of a linear SEM with DG G1 is the sameas the set of zero partial correlations entailed forall values of the free parameters of a linear SEMwith G2. For linear models, this is the same asd-separation equivalence. [13]
• DGs G1 and G2 are covariance equivalent if andonly if for every set of parameter values for the freeparameters of a linear SEM with DG G1, there isa set of parameter values for the free parametersof a linear SEM with DG G2 such that the twoSEMs entail the same covariance matrix over thesubstantive variables, and vice-versa.
• DGs G1 and G2 are distribution equivalent if andonly if for every set of parameter values for the freeparameters of a linear SEM with DG G1, there is aset of parameter values for the free parameters ofa linear SEM with DG G2 such that the two SEMsentail the same distribution over the substantivevariables, and vice-versa. Do not confuse this withthe notion of distribution-entailment equivalencebetween SEMs: two SEMs with fixed parametersare distribution-entailment equivalent i↵ they en-tail the same distribution.
It follows from well-known theorems about the Gaus-sian case [13], and some trivial consequences of knownresults about the non-Gaussian case [12], that the fol-lowing relationships exist among the di↵erent senses ofequivalence for acyclic graphs: If all of the error termsare assumed to be Gaussian, distribution equivalenceis equivalent to covariance equivalence, which in turnis equivalent to d-separation equivalence. If not all of
the error terms are assumed to be Gaussian, then dis-tribution equivalence entails (but is not entailed by)covariance equivalence, which entails (but is not en-tailed by) d-separation equivalence.
So for example, given Gaussian error terms, A Band A! B are zero partial correlation equivalent, co-variance equivalent, and distribution equivalent. Butgiven non-Gaussian error terms, A B and A ! Bare zero-partial-correlation equivalent and covarianceequivalent, but not distribution equivalent. So forGaussian errors and this pair of DGs, no algorithmthat relies only on observational data can reliably se-lect a unique acyclic graph that fits the population dis-tribution as the correct causal graph without makingfurther assumptions; but for all (or all except one) non-Gaussian errors there will always be a unique acyclicgraph that fits the population distribution.
While there are theorems about the case of cyclicgraphs and Gaussian errors, we are not aware of anysuch theorems about cyclic graphs with non-Gaussianerrors with respect to distribution equivalence. Inthe case of cyclic graphs with all Gaussian errors,distribution equivalence is equivalent to covarianceequivalence, which entails (but is not entailed by) d-separation equivalence [14]. In the case of cyclic graphsin which at most one error term is non-Gaussian, dis-tribution equivalence entails (but is not entailed by)covariance equivalence, which in turn entails (but isnot entailed by) d-separation equivalence. However,given at most one Gaussian error term, the importantdi↵erence between acyclic graphs and cyclic graphs isthat no two di↵erent acyclic graphs are distributionequivalent, but there are di↵erent cyclic graphs thatare distribution equivalent.
Hence, no algorithm that relies only on observationaldata can reliably select a unique cyclic graph that fitsthe data as the correct causal graph without mak-ing further assumptions. For example, the two cyclicgraphs in Fig. 3 are distribution equivalent.
5.2 The output of LiNG-D is correct and asfine as possible
Theorem 1 The output of LiNG-D is a set of SEMsthat comprise a distribution-entailment equivalenceclass.
Proof: First, we show that any two SEMs in the out-put of LiNG-D entail the same distribution.
The weight matrix output by ICA is determined onlyup to scaling and row permutation. Intuitively, then,permuting the error terms does not change the mix-ture. Now, more formally:
Lacerda, Spirtes, Ramsey and Hoyer (2008). Discovering cyclic causal models by independent component analysis. In Proc. UAI.