A Visualizing Tool for Conditional Independencerwoldfor/students/DerekWang/Thesis.pdf · i=1 R x i R X R y i R q Y P n i=1 R x i R X 2 P n i=1 R y i R Y 2 = 1 6 P n i=1 (R x i R y)

A Visualizing Tool for ConditionalIndependence

by

Daoxing Wang

A thesispresented to the University of Waterloo

in fulfillment of thethesis requirement for the degree of

Master of Mathematicsin

Statistics

Waterloo, Ontario, Canada, 2012

c© Daoxing Wang 2012

I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis,including any required final revisions, as accepted by my examiners.

I understand that my thesis may be made electronically available to the public.

ii

Abstract

Testing for conditional independence of two variables given or conditional on anothervariable, especially a continuous variable, is a challenging but common statistical problem.For the purpose of data visualization for conditional independence, conditioning plot doesa very good job to express independence of two variables given or conditional on one ormore discrete variable(s). However, there is not a feasible and efficient visualizing tool toinvestigate independence of two variables given or conditional on one or more continuousvariable(s). The difficulty is how to determine subsets of the conditioning set. In thispaper, I propose two methods to determine the subsets. One way is that each subset hassame size (k-nearest neighbours); another way is that each subset containing all data pointsin a partition of the control variable’s space, and each partition has the same volume.Keywords: data visualization, conditional independence, slicing, nearest neighbourhood,data partitioning.

iii

Acknowledgements

haha

iv

Table of Contents

List of Tables vii

List of Figures viii

1 Introduction 1

2 Quantitative Measurement of Dependence between Random Variables 2

2.1 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . 2

2.1.2 Spearman’s Rank Correlation Coefficient . . . . . . . . . . . . . . . 5

2.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Probability via Expectation . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Cramer-von Mises Type Statistic . . . . . . . . . . . . . . . . . . . 10

2.3 Hoeffding’s Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Test of Independence Based on the Empirical Copula Process . . . . . . . 16

2.5 Distance Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Visual Display of Dependence between Random Variables 29

3.1 Mosaic Display and Its Extended Version . . . . . . . . . . . . . . . . . . . 30

3.1.1 Construction of Mosaic Display . . . . . . . . . . . . . . . . . . . . 31

v

3.1.2 Extension of Mosaic Display . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Picturing Probability: Eikosogram . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Validation of Visual Statistical Inference . . . . . . . . . . . . . . . . . . . 48

3.4 Visual Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.1 Visual Test of Independence via Contour . . . . . . . . . . . . . . . 58

4 Simulation Study 60

5 Summary 61

APPENDICES 61

A Derivations 62

A.1 Derivation of Spearman’s Correlation Coefficient . . . . . . . . . . . . . . . 62

A.2 Derivation of ∆ in Hoeffding’s Independence Test . . . . . . . . . . . . . . 65

A.3 Derivation of V ar (Dn) in Hoeffding’s Independence Test . . . . . . . . . . 67

A.4 Derivation of Computational Form of Tn in Test of Independence Based onthe Empirical Copula Process . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.5 Simplification of |ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2 in Distance Correlation Co-efficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

A.6 Derivation of Computational Form of V2n (X,Y), V2

n (X) and V2n (Y) in Dis-

tance Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 73

B R Code 77

References 78

vi

List of Tables

3.1 Hair Color vs Eye Color Data . . . . . . . . . . . . . . . . . . . . . . . . . 31

vii

List of Figures

2.1 Bad Examples of Pearson’s Correlation Coefficient . . . . . . . . . . . . . . 4

3.1 Mosaic Plot of Hair Color vs Eye Color Data . . . . . . . . . . . . . . . . . 32

3.2 Independent Mosaic Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Mosaic Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Colored Mosaic Plot of Hair Color vs Eye Color Data . . . . . . . . . . . . 34

3.5 A Univariate Example of Eikosogram . . . . . . . . . . . . . . . . . . . . . 37

3.6 A Bivariate Eikosogram Conditional on X . . . . . . . . . . . . . . . . . . 38

3.7 A Bivariate Eikosogram Conditional on Y . . . . . . . . . . . . . . . . . . 39

3.8 Eikosogram Conditional on X of Perfectly Positive Association . . . . . . . 40

3.9 Eikosogram Conditional on Y of Perfectly Positive Association . . . . . . . 40

3.10 Eikosogram Conditional on X of Perfectly Negative Association . . . . . . 40

3.11 Eikosogram Conditional on Y of Perfectly Negative Association . . . . . . 40

3.12 Eikosogram Conditional on X of Part Perfectly Positive Association . . . . 41

3.13 Eikosogram Conditional on Y of Part Perfectly Positive Association . . . . 41

3.14 Eikosogram Conditional on X of Part Perfectly Negative Association . . . 41

3.15 Eikosogram Conditional on Y of Part Perfectly Negative Association . . . 41

3.16 Eikosogram Conditional on X of Positive Association . . . . . . . . . . . . 42

3.17 Eikosogram Conditional on Y of Positive Association . . . . . . . . . . . . 42

3.18 Eikosogram Conditional on X of Negative Association . . . . . . . . . . . . 42

3.19 Eikosogram Conditional on Y of Negative Association . . . . . . . . . . . . 42

viii

3.20 Eikosogram Conditional on X of Independence . . . . . . . . . . . . . . . . 43

3.21 Eikosogram Conditional on Y of Independence . . . . . . . . . . . . . . . . 43

3.22 Eikosogram Conditional on X of Mutually Exclusive . . . . . . . . . . . . 44

3.23 Eikosogram Conditional on Y of Mutually Exclusive . . . . . . . . . . . . . 44

3.24 Eikosogram Conditional on X . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.25 Eikosogram Conditional on Y . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.26 A Univariate Example of Eikosogram . . . . . . . . . . . . . . . . . . . . . 45

3.27 3-D Eikosogram Conditional on X and Z . . . . . . . . . . . . . . . . . . . 46

3.28 3-D Eikosogram Conditional on Y and Z . . . . . . . . . . . . . . . . . . . 46

3.29 3-D Eikosogram Conditional on X and Y . . . . . . . . . . . . . . . . . . . 46










3.39 Line-up of Example 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52



3.42 An Extended Scatter Plot Allowing Overlap of Data Points . . . . . . . . . 58

3.43 A Heat Map of 2-D Kernel Density Estimation . . . . . . . . . . . . . . . . 58

3.44 An Extended Scatter Plot Allowing Overlap of Data Points . . . . . . . . . 59

3.45 A Heat Map of 2-D Kernel Density Estimation . . . . . . . . . . . . . . . . 59

ix

Chapter 1

Introduction

1

Chapter 2

Quantitative Measurement ofDependence between RandomVariables

2.1 Correlation coefficient

To measure dependence of two variables, correlation could be a good candidate. Thereare several correlation measurements introduced in last two centuries, such as Pearson’scorrelation coefficient, Spearman’s rank correlation coefficient. The Pearson’s correlationcoefficient is the oldest one among these two, and then Spearman’s rank correlation coef-ficient is introduced. Pearson’s method is only able to detect linear relationship betweentwo random variables. Although Spearman’s method improve it, it still cannot detectnon-linear relationship but monotonic. The last but not the least, the reason why bothcorrelation coefficient fail is discussed.

2.1.1 Pearson’s Correlation Coefficient

The Pearson’s correlation coefficient is commonly taught introductory-level statistical course,and widely known in non-statistical area. Suppose we have two random variables X andY . The Pearson’s correlation coefficient is denoted by ρ and is defined by

ρ =Cov (X, Y )√

V ar (X) · V ar (Y )=E [(X − µX) (Y − µY )]

σX · σY(2.1)

2

where µ is the mean and σ is the standard deviation. ρ is bounded between −1 and 1.

Suppose we have n observations {(xi, yi) i = 1, . . . , n}. The sample Pearson’s correla-tion coefficient is

rx,y =

∑ni=1 (xi − x) (yi − y)√∑n

i=1 (xi − x)2 ·∑n

i=1 (yi − y)2(2.2)

where x = 1n

∑ni=1 xi and y = 1

n

∑ni=1 yi.

A well-known fact is that ρ = E[(X−µX)(Y−µY )]σX ·σY

= 0 if X and Y are independent. How-ever, the reverse is not true. Thus, this is a limitation of the Pearson’s correlation coefficientto test dependence. The Pearson’s correlation coefficient only indicates the strength of lin-ear relationship between two random variables. It is not able to detect the non-linearrelationship between variables. The weaknesses of Pearson’s correlation coefficient areshown in the following examples.

3

Figure 2.1: Bad Examples of Pearson’s Correlation Coefficient

4

The sample Pearson’s correlation coefficients of all above examples are about 0. Re-gardless of independence or dependence of two random variables, the Pearson’s correlationcoefficient cannot reveal the association between random variables. In fact, we well knowthat the Pearson’s correlation coefficient is only sensitive to linear relationship.

2.1.2 Spearman’s Rank Correlation Coefficient

Although the Pearson’s correlation coefficient furnish us a mathematical framework to mea-sure association between random variables, the defects restrict it to be applied in practice.Based on the framework, Charles Spearman provide a patch to the defects of the Pearson’scorrelation coefficient. Suppose we have n observations (xi, yi) i = 1, . . . , n. Instead ofcalculating the Pearson’s correlation coefficient of xi’s and yi’s, it may be interesting thatwe calculate for the ranks of xi’s and yi’s. The ranks of xi and yi are denoted by Rxi and

Ryi respectively. Let Rx = 1n

∑ni=1Rx,i = 1

n(1 + · · ·+ n) = 1

nn(n+1)

2= n+1

2and Ry = n+1

2,

so Rx = Ry. Then, the sample Pearson’s correlation coefficient of Rx and Ry is

rRx,Ry =

∑ni=1

(Rxi − RX

) (Ryi − RY

)√∑ni=1

(Rxi − RX

)2∑ni=1

(Ryi − RY

)2

= 1− 6∑n

i=1 (Rxi −Ryi)2

n (n2 − 1)

The detail of simplification is provided in Appendix. The Pearson’s correlation coefficientof ranks of X and Y is known as the Spearman’s rank correlation coefficient that was firstintroduced by Charles Spearman.

Definition 2.1.1. The sample Spearman’s correlation coefficient is denoted by rs and isdefined as

rs = 1− 6∑n

i=1 d2i

n (n2 − 1)where di = Rxi −Ryi (2.3)

Obviously, rs is also bounded between −1 and 1 as the Pearson’s correlation coefficient.The following two examples are provided to reveal the improvements of Spearman’s rankcorrelation coefficient over Pearson’s correlation coefficient.

5

The Spearman’s rank correlation coefficient perfectly shows the strong relationshipbetween X and Y , and while Pearson’s correlation coefficient does not. In fact, the Spear-man’s rank correlation coefficient is not only sensitive to a linear relationship but alsomonotonic relationship between two variables. The ranks of x’s and the ranks of yi’s havea perfectly linear relationship if they are a monotonic function of each other. In addition,the Spearman’s correlation coefficient is less sensitive to outliers.

In the second column, outliers are added to the sample. It is very clear that theSpearman’s rank correlation coefficient is less sensitive to outliers than the Pearson’s one.When there is an outlier,

(Rx,outlier − Rx

) (Ry,outlier − Ry

)has smaller magnitude than

(xoutlier − x) (youtlier − y). Thus, outliers have less leverage in the Spearman’s rank corre-lation coefficient than that in the Pearson’s correlation coefficient.

6

However, the Spearman’s rank correlation coefficient does a very poor job on the fol-lowing examples.

7

Although Spearman’s rank correlation coefficient makes a progress in terms of detectingnon-linear relationship, it still fails in examples as above shown. There are remarks for theSpearman’s correlation coefficient:

• The Spearman’s correlation coefficient can reveal the linear and monotonic relation-ship between random variables.

• The Spearman’s correlation coefficient is less sensitive to outliers than the Pearson’scorrelation coefficient.

• For more complicated relationship, such as

2.2 Preliminary

2.2.1 Probability via Expectation

Whittle [2000] provides a new view of probabilities via expectation. In his version, inde-pendence between two variables is re-defined. Through the new definition of independence,we explore the reason why two correlation coefficients above fail.

Definition 2.2.1. The two random variables X and Y are said to be independent if andonly if

E [H (X)K (Y )] = E [H (X)]E [K (Y )] (2.4)

for all functions H (·) and K (·) such that both individual expectation exists.

For Pearson’s correlation coefficient, we know that ρ = Cov(X,Y )√V ar(X)·V ar(Y )

= E[XY ]−E[X]E[Y ]√V ar(X)·V ar(Y )

=

0 implies E [XY ] = E [X]E [Y ], which is only a special case of E [H (X)K (Y )] = E [H (X)]E [K (Y )]when H (X) = X and K (Y ) = Y . Therefore, we cannot conclude that two random vari-ables are independent if the Pearson’s correlation coefficient is equal to 0. Similarly, theSpearman’s correlation coefficient is the Pearson’s correlation coefficient of ranks of data,and thus the problem is not solved as well as bad examples are shown above. To fix thisproblem, through the last century, studies on topic of test of independence propose newmethods based on different H (X) = X and K (Y ) = Y . There are three methods dis-cussed in this chapter, and all equivalent null hypotheses of independence are based onE [H (X)K (Y )] = E [H (X)]E [K (Y )]:

8

• Hoeffding’s Independence TestSuppose H (X) = I (X ≤ x) and K (Y ) = I (Y ≤ y) for all x, y, and thus

E [H (X)] = E [I (X ≤ x)] = P (X ≤ x) = FX (x)

E [K (Y )] = E [I (Y ≤ y)] = P (Y ≤ y) = FY (y)

andE [H (X)K (Y )] = E [I (X ≤ x) I (Y ≤ y)] = FX,Y (x, y)

Substitute above equation into Equation 2.4, and then we have that

FX,Y (x, y) = FX (x)FY (y)

Thus, the equivalent null hypothesis of independence is FX,Y (x, y) = FX (x)FY (y)for all x, y, which is the familiar definition of independence of two random variables.

• Test of Independence Based on Empirical CopulaThe equivalent null hypothesis used in this method is same as Hoeffding’s indepen-dence test. However, joint cumulative function is transformed into a copula function.Each component of a copula function is a marginal cumulative distribution func-tion, and the dependence structure between components is described by the copulafunction. The definition of a copula function is given in the Section 2.4. By thedefinition of a copula function and Sklar’s Theorem,

FX,Y (x, y) = C (FX (x) , FY (y)) where C is a copula function

= C (u, v) where u = FX (x) and v = FY (y)

Thus, in copula setting, the equivalent null hypothesis of independence is C (u, v) =uv for all 0 ≤ u, v ≤ 1.

• Distance Correlation CoefficientSuppose H (X) = eit

TX and K (Y ) = eisTY where t,X ∈ Rp and s,Y ∈ Rq (p, a is

positive integer) for all t, s. Therefore, we have that

E [H (X)] = E[eit

TX]

= ϕX (t)

E [K (Y )] = E[eis

TY]

= ϕY (s)

andE [H (X)K (Y )] = E

[eit

TXeisTY]

= E[eit

TX+isTY]

= ϕX,Y (t, s)

9

Substitute above equation into Equation 2.4, and then we have that

ϕX,Y (t, s) = ϕX (t)ϕY (s)

Thus, the equivalent null hypothesis of independence is ϕX,Y (t, s) = ϕX (t)ϕY (s)for all t ∈ Rq, s ∈ Rq.

2.2.2 Cramer-von Mises Type Statistic

Cramer-von Mises criterion, named after Cramer and Von Mises [1964], is originally used tomeasure dispersion between a cumulative distribution function F (·) and a given empiricalcumulative distribution function Fn (·). The null hypothesis is Fn (x) = F (x) for all x.The corresponding statistic is

T 2 =

∫[Fn (x)− F (x)]2 dF (x)

Later, a weighted Cramer-von Mises type statistic is introduced.

T 2 (ω (x)) =

∫[Fn (x)− F (x)]2 ω (x) dF (x)

where ω (x) is a positive weight function. Since the integrand is positive, so T 2 (ω (x)) = 0if and only if the null hypothesis holds. The methods in subsequent sections are eitherCramer-von Mises type statistics or like Cramer-von Mises type statistic.

• Hoeffding’s Independence TestAccording to the equivalent null hypothesis of independence used in this method,Hoeffding [1948a] proposes a test statistic based on the equivalent null hypothesis ofindependence:

∆ =

∫ ∫[FX,Y (x, y)− FX (x)FY (y)]2 dFX,Y (x, y)

In fact, ∆ = 0 if and only if FX,Y (x, y) − FX (x)FY (y) = 0, which is shown in thesubsequent section. Although this statistic is not a weighted Cramer-von Mises typestatistic, they share the “if and only if” property.

• Test of Independence Based on Empirical CopulaThe test statistic used in this method is exactly a weighted Cramer-von Mises type

10

statistic. The weighted function is ω (u, v) = nuv

where n is the sample size andu, v ∈ (0, 1). The equivalent null hypothesis is C (u, v) = uv.

Tn =

∫ 1

0

∫ 1

0

(Cn (u, v)− C (u, v))2 n

uvdC (u, v)

• Distance Correlation CoefficientThe following statistic is so-call distance covariance, which is also the distance be-tween ϕX,Y (t, s) and ϕX (t) · ϕY (s) on L2 space or Hilbert space.

V2 (X, Y ;ω) = ‖ϕX,Y (t, s)− ϕX (t) · ϕY (s)‖2L2,ω

=

∫Rp+q

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2 ω (t, s) dt ds ≥ 0

The equality holds if and only ϕX,Y (t, s) − ϕX (t) · ϕY (s) = 0, which implies theindependence between random variables X and Y .

2.3 Hoeffding’s Independence Test

When we think about the independence of two random variables X and Y , by definition,they are independent if and only if the joint cumulative distribution is equal to the productof two marginal cumulative distribution. The followings are notations used in this thesis:

• FX (x): marginal cumulative distribution function of X

• FY (y): marginal cumulative distribution function of Y

• FX,Y (x, y): joint cumulative distribution function of X and Y

Let D (x, y) = FX,Y (x, y) − FX (x)FY (y), and D (x, y) = 0 if and only if X and Y areindependent by definition. To test the independence, it is very natural to use the statistic∆ =

∫ ∫D2 (x, y) dFX,Y (x, y), since the property shown in the literature of Hoeffding

[1948a].

Theorem 2.3.1. If FX,Y (x, y) belongs to the class of distributions with continuous jointand marginal probability functions, ∆ = ∆ (F ) = 0 if and only if D (x, y) = 0 which impliesthe independence between random variables X and Y .

11

Proof. “if part”: D (x, y) = 0 will immediately imply ∆ =∫ ∫

D2 (x, y) dFX,Y (x, y) = 0.Since FX,Y (x, y) is in that class, so we have d (x, y) = D′ (x, y) = fX,Y (x, y)−fX (x) fY (y)is continuous as well.

“only if part”: D (x, y) 6= 0 implies d (x, y) 6= 0 but we have∫ ∫d (x, y) dx dy

=

∫ ∫fX,Y (x, y)− fX (x) fY (y) dx dy

=

∫ ∫fX,Y (x, y) dx dy −

∫ ∫fX (x) fY (y) dx dy

=

∫ ∫fX,Y (x, y) dx dy −

∫fX (x) dx

∫fY (y) dy

= 1− 1 · 1 = 0

Hence there exists a rectangle Q ∈ R2 such that d (x, y) > 0 ∀ (x, y) ∈ Q. In Q, there is atmost one point such that D (x, y) = 0. Then, we have

∆ =

∫ ∫D2 (x, y) dFX,Y (x, y) ≥

∫ ∫Q

D2 (x, y) fX,Y (x, y) dx dy > 0

Therefore, the equivalent null hypotheses of independence is H0 : ∆ = 0. Let

ψ (a, b, c) = I (b ≤ a)− I (c ≤ a) (2.5)

where I (·) is an indicator function and

φ (x1, y1; . . . ;x5, y5) =1

4ψ (x1, x2, x3)ψ (x1, x4, x5)ψ (y1, y2, y3)ψ (y1, y4, y5) (2.6)

Then we can write ∆ as

∆ =

∫· · ·∫φ (x1, y1; . . . ;x5, y5) dFX,Y (x1, y1) · · · dFX,Y (x5, y5) (2.7)

The detail of derivation of ∆ is given in Appendix. Let {(Xi, Yi) ; i = 1, . . . , n} be randomvariables from FX,Y (x, y) where n ≥ 5, and let

Dn =1

n (n− 1) · · · (n− 4)

n∑α1=1

n∑α2=1

n∑α3=1

n∑α4=1

n∑α5=1

φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5) (2.8)

12

where αi = 1, . . . , n and αi 6= αj if i 6= j (i, j = 1, . . . , 5). Notice that the number of termsin this summation is n (n− 1) · · · (n− 4). Thus, Dn is the average of φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5).By Equation 2.7, we can see thatE [Dn] = ∆ for any positive integer n since E [φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)] =∆. In addition, by another paper of Hoeffding [1948b], the random variable Dn belongs tothe class of U-statistics. Following the results and theorems in that paper, we know that

V ar (Dn) =

(n

5

)−1 5∑k=1

(n− 5

5− k

)(5

k

)ξk (2.9)

The detail of derivation is in Appendix. Therefore, we have that 25 ξ1 ≤ nV ar (Dn) ≤5 ξ5. The lower bound of nV ar (Dn) is achieved when n approaches to infinity, i.e.

limn→∞ nV ar (Dn) = 25 ξ1. In addition,√n (Dn −∆)

d→ N (0, 25 ξ1). According toEquation 2.8 above, we have a computational form of Dn:

Dn =A− 2 (n− 2)B + (n− 2) (n− 3)C

n (n− 1) (n− 2) (n− 3) (n− 4)(2.10)

where

A =n∑

α=1

aα (aα − 1) bα (bα − 1)

B =n∑

α=1

(aα − 1) (bα − 1) cα

C =n∑

α=1

cα (cα − 1)

and

aα =n∑β=1

I (Xβ ≤ Xα)− 1

bα =n∑β=1

I (Yβ ≤ Yα)− 1

cα =n∑β=1

I (Xβ ≤ Xα) I (Yβ ≤ Yα)− 1

13

In fact, aα + 1 and bα + 1 are the ranks of Xα and Yα that are denoted by RX,α and RY,α

respectively. Then, we can write aα, bα and cα as

aα = RX,α − 1

bα = RY,α − 1

cα =n∑β=1

I (Xβ ≤ Xα) I (Yβ ≤ Yα)− 1

= min (RX,α, RY,α)− 1

Then,

A =n∑

α=1

(RX,α − 1) (RX,α − 2) (RY,α − 1) (RY,α − 2)

B =n∑

α=1

(RX,α − 2) (RY,α − 2) (min (RX,α, RY,α)− 1)

C =n∑

α=1

(min (RX,α, RY,α)− 1) (min (RX,α, RY,α)− 2)

Thus, the test statistic Dn is a function of RX,α and RY,α. Next step, we may want toknow how to calculate the corresponding p-value of this test.

p− value = P (Dn > dn|H0) (2.11)

Since FX (x), FY (y) and FX,Y (x, y) are assume to be continuous, so FX (x1) < FX (x2) ifand only x1 < x2 as well as FX (x1) = FX (x2) if and only if x1 = x2. The same inequalityholds for Y . This implies that if we replace xi and yi by FX (xi) and FY (yi) respectively,Equation 2.5 ψ (x1, x2, x3) = I (x2 ≤ x1) − I (x3 ≤ x1) remains the same value, so doesEquation 2.6 φ (x1, y1; . . . ;x5, y5) = 1

4ψ (x1, x2, x3)ψ (x1, x4, x5)ψ (y1, y2, y3)ψ (y1, y4, y5).

In other words,ψ (x1, x2, x3) = ψ (FX (x1) , FX (x2) , FX (x3))

andφ (x1, y1; . . . ;x5, y5) = φ (FX (x1) , FY (y1) ; . . . ;FX (x5) , FY (y5))

In addition, ∆(Equation 2.7) and Dn(Equation 2.8) are invariant under the substitutionui = FX (xi) and vi = FY (yi). Recall the definitions in the derivation of V ar (Dn): Asymmetric function Φ (X1, Y1; · · · ;X5, Y5) is defined as

Φ (X1, Y1; · · · ;X5, Y5) = D5 =1

5!

5∑α1=1

5∑α2=1

5∑α3=1

5∑α4=1

5∑α5=1

φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)

14

where αi = 1, . . . , n and αi 6= αj if i 6= j (i, j = 1, . . . , 5). Under the null hypothesis ofindependence, the following results are obtained:

1. FX,Y (x, y) = FX (x)FY (y) = uv

2. ∆ = 0

3. Since φ (·) is invariant under this substitution, so Φ (x1, y1; · · · ;x5, y5) = Φ (u1, v1; · · · ;u5, v5).

Φk (x1, y1; · · · ;xk, yk) is defined as

Φk (x1, y1; · · · ;xk, yk)

= E(k+1),...,5 [Φ (x1, y1; · · · ;xk, yk;Xk+1, Yk+1; · · · ;X5, Y5)] k = 1, . . . , 4

Φ5 (x1, y1; · · · ;x5, y5)

= Φ (x1, y1; · · · ;x5, y5) (2.12)

where (x1, y1) , . . . , (xk, yk) are arbitrary fixed vectors, (Xk+1, Yk+1) , . . . , (Y5, Y5) are ran-dom vectors and E(k+1),...,5 denotes the expectation taken over (Xk+1, Yk+1) , . . . , (X5, Y5).We also know that

E1,...,k [Φk (X1, Y1; · · · ;Xk, Yk)]

= E1,...,k

[E(k+1),...,5 [Φ (X1, Y1; · · · ;Xk, Yk;Xk+1, Yk+1; · · · ;X5, Y5)]

]= E1,...,5 [Φ (X1, Y1; · · · ; · · · ;X5, Y5)]

= E [D5]

= ∆ k = 1, . . . , 5

Suppose the variance of Φk (X1, Y1; · · · ;Xk, Yk) exists and denoted by ξk

ξk = V ar1,...,k (Φk (X1, Y1; · · · ;Xk, Yk))

= E1,...,k

[(Φk (X1, Y1; · · · ;Xk, Yk)−∆)2]

=

∫· · ·∫

[Φk (x1, y1; . . . ;xk, yk)−∆]2 dF (x1, y1) · · · dF (xk, yk)

Under null hypothesis of independence, it can be written as

ξk,indep =

∫ 1

0

· · ·∫ 1

0

[Φk (u1, v1; . . . ;uk, vk)]2 du1dv1 · · · dukdvk where k = 1, . . . , 5

15

In this setting, an indefinite integral is transformed into an definite integral, and evaluationsare given below:

ξ1,indep = 0 ξ2,indep =1

810000ξ3,indep =

7

810000ξ4,indep =

41

1215000ξ5,indep =

1

9000

Recall Equation 2.9

V ar (Dn|H0) =5!

n (n− 1) · · · (n− 4)

5∑k=1

(5

k

)(n− 5

5− k

)ξk,indep

=n2 + 5n− 32

4050n (n− 1) (n− 3) (n− 4)(2.13)

Under null hypothesis of independence, E [Dn|H0] = ∆ = 0. Although the expectationand the variance of Dn|H0 are known and even we know the distribution of Dn, we don’tknow the distribution of Dn|H0. To accomplish the test, an upper bound of significancelevel is furnished. By Chebyshev’s inequality,

P (|Dn − E [Dn] | ≥ k · SD [Dn]|H0) ≤ 1

k2

P

(|Dn| ≥ k

√n2 + 5n− 32

4050n (n− 1) (n− 3) (n− 4)

∣∣∣∣∣H0

)≤ 1

k2

Replace 1k2

by the significance level α, and then k = 1√α

.

P

(|Dn| ≥

√n2 + 5n− 32

4050n (n− 1) (n− 3) (n− 4)α

∣∣∣∣∣H0

)≤ α (2.14)

Hoeffding [1948a] shows that since P (Dn ≤ 0) approaches toward 0 if ∆ > 0, it is safeto not reject H0 whenever the computational form Dn ≤ 0. Therefore, null hypothesisof independence is rejected if and only if the computational form Dn > dn where dn =√

n2+5n−324050n(n−1)(n−3)(n−4)α

.

2.4 Test of Independence Based on the Empirical Cop-

ula Process

When people study dependence structure between random variables, the ideal situationis that the joint distribution function or the joint cumulative function of those random

16

variables is known. However, that’s not the case in practice. The next best thing isthat we have the marginal distributions for each random variables and connect them bya magic function. The magic function now is referred to a copula function. Startingwith only two random variables X and Y , we have the marginal cumulative distributionfunctions FX(x) = P (X ≤ x) and FY (y) = P (Y ≤ y) for X and Y respectively and assumethat they are continuous. Let (U, V ) = (FX (X) , FY (Y )), easily we can see that (U, V )has uniform margins. To link them together, the copula of (X, Y ) is defined as the jointcumulative distribution function of (U, V ): C (u, v) = P (U ≤ u, V ≤ v). To discover therelationship between a copula function and the joint cumulative distribution function ofX and Y , we apply the inverse functions of FX and FY to (U, V ). To find the inversefunctions of FX and FY is not a problem at all because we assume the continuity of twocumulative distribution functions. Thus, (X, Y ) =

(F−1X (U) , F−1

Y (V )). The above copula

function can be re-written as

C (u, v) = P (U ≤ u, V ≤ v)

= P(X ≤ F−1

X (u) , Y ≤ F−1Y (v)

)(2.15)

Therefore, the copula is till able to describe the dependence structure of X and Y . Un-der above setting, some properties of a copula function inherit directly from cumulativedistribution function. U and V are inclusively bounded between 0 and 1, and they arenon-decreasing. Nelsen [2006] provides an introduction to copula and its properties.

Definition 2.4.1. A 2-dimensional copula is a function C : [0, 1]2 → [0, 1] with the fol-lowing properties:

1. For every u and v in [0, 1], since U ∼ UNIF [0, 1] and V ∼ UNIF [0, 1]

• C (u, 0) = P (U ≤ u, V ≤ 0) = 0

• C (0, v) = P (U ≤ 0, V ≤ v) = 0

• C (u, 1) = P (U ≤ u, V ≤ 1) = P (U ≤ u) = u

• C (1, v) = P (U ≤ 1, V ≤ v) = P (V ≤ v) = v

2. For every 0 ≤ u1 ≤ u2 ≤ 1 and 0 ≤ v1 ≤ v2 ≤ 1,

C (u2, v2)− C (u2, v1) ≥ C (u1, v2)− C (u1, v1)

17

From the definition, it is not easy to see the relationship between a copula functionand the joint cumulative distribution function. Suppose we have two random variables Xand Y with marginal cumulative distribution functions FX (x) = P (X ≤ x) and FY (y) =P (Y ≤ y) respectively. By Sklar’s theorem, a joint cumulative distribution function of Xand Y

FX,Y (x, y) = P (X ≤ x, Y ≤ y) where x, y ∈ Rcan be written as

FX,Y (x, y) = C (FX (x) , FY (y)) where C is a copula function

= C (u, v) where u = FX (x) and v = FY (y)

In addition, C is unique if FX and FY are continuous. X and Y are independent if andonly if C (u, v) = uv, because FX,Y (x, y) = FX (x)FY (y) ∀x, y if and only if X and Y areindependent. The empirical copula is also interesting.

Definition 2.4.2. A 2-dimensional empirical copula Cn (u, v) is uniquely defined on [0, 1]2

by

Cn (u, v) =1

n

n∑i=1

I(FX (xi) ≤ u

)I(FY (yi) ≤ v

)where I (·) is the indication function, and FX (·) and FY (·) are empirical cumulative

distribution functions of X and Y respectively. Suppose we have observations {(xi, yi) i =1, . . . , n}. Notice that Cn (u, v) is actually a function of the ranks of those observations.

nFX (xi) =n∑j=1

I (xj ≤ xi) = rank(xi) denotes as Rxi ∀i = 1, . . . , n

Similarly,

nFY (yi) =n∑j=1

I (yj ≤ yi) = rank(yi) denotes as Ryi ∀i = 1, . . . , n

Then, the empirical copula Cn (u, v) can be re-written as

Cn (u, v) =1

n

n∑i=1

I

(Rxi

n≤ u

)I

(Ryi

n≤ v

)=

1

n

n∑i=1

I (Rxi ≤ un) I (Ryi ≤ vn)

18

Therefore, Genest and Remillard [2004] claim that any rank test statistic of independencecan be written as a function of copula. For example, by Kruskal [1958], the Spearman’scorrelation coefficient ρs can be written as

ρs = 12

∫ ∞−∞

∫ ∞−∞

(FX,Y (x, y)− FX (x)FY (y)) dFX (x) dFY (y)

If we make substitutions u = FX (x), u = FX (x) and C (u, v) = FX,Y (x, y), and then

rs = 12

∫ ∞−∞

∫ ∞−∞

(FX,Y (x, y)− FX (x)FY (y)) dFX (x) dFY (y)

= 12

∫ 1

0

∫ 1

0

(C (u, v)− uv) du dv

= 12

∫ 1

0

∫ 1

0

C (u, v) du dv − 12

∫ 1

0

∫ 1

0

uv du dv

= 12

∫ 1

0

∫ 1

0

C (u, v) du dv − 3

Another popular rank correlation coefficient, the Kendall’s correlation coefficient τ , is alsoa function of copula.

τ = 4

∫ 1

0

∫ 1

0

C (u, v) dC (u, v)− 1

The task here is to test the dependence of two random variable X and Y . In the copulasetting, the equivalent null hypothesis is H0 : C (u, v) = uv. Based on empirical copula,Ghoudi et al. [2001] introduce a Cramer-von Mises type statistic that is

Tn =

∫ 1

0

∫ 1

0

(Cn (u, v)− C (u, v))2 n

uvdC (u, v)

= n

∫ 1

0

∫ 1

0

(Cn (u, v)− uv)2 uv

uvdu dv

= n

∫ 1

0

∫ 1

0

(Cn (u, v)− uv)2 du dv

=1

n

n∑i=1

n∑j=1

(1

3+

1

2

(Rxi

n

)2

+1

2

(Rxj

n

)2

−max

(Rxi , Rxj

)n

)(

1

3+

1

2

(Ryi

n

)2

+1

2

(Ryj

n

)2

−max

(Ryi , Ryj

)n

)

19

The derivation of Tn is given in the Appendix. Under null hypothesis of independence,

asymptotic distribution of Tn is provided, which is∑∞

i=1

∑∞j=1

1π4(ij)2

Xi,j where Xi,ji.i.d∼

χ2 (1). The corresponding p-value is P (Tn ≥ t|H0) = 1−Fn (Tn) where Fn is the cumulativedistribution function of Tn under null hypothesis of independence. However, in practice, itis very difficult to know this distribution exactly. Genest and Remillard [2004] introduce analgorithm to carry out the test. Since we focus on 2-dimensional space, so there is a simplerversion. Suppose we have observations {(xi, yi) ; i = 1, . . . , n}, from a joint distribution.

Algorithm:

1. Compute Tn based on the sample, and set it equal Tn,0.

2. Generate N pseudo-random samples of size n from a uniform distribution on [0, 1]2

and Tn,i i = 1, . . . , N will denote the value of Tn calculated based on the ith sample.

3. Define

Fn (t) =1

N

N∑i=1

I (Tn,i ≤ t) where t ≥ 0

and computep− value = 1− Fn (Tn,0)

2.5 Distance Correlation Coefficient

Szekely et al. [2007] introduce a new measurement of dependence between random variables.Recall that the Pearson’s correlation coefficient ρ = 0 if random variables X and Y areindependent. However, the reverse is not true. Distance correlation coefficient of tworandom variables X and Y : R (X, Y ) = 0 if and only if X and Y are independent for alldistribution with finite first moments, which is the primary contribution of their literature.The task is test the dependence of two random variables. The equivalent null hypothesis ofindependence in their work is based on characteristic functions. The merit of characteristicfunctions is that they always exist for a real-valued random variables, since it is an integralof a continuous cumulative distribution function that is evaluated between 0 and 1. Supposethat X and t in Rp and Y and s are in Rq, and the marginal characteristic functions of X

20

and Y and the joint characteristic function of (X,Y) are defined as

ϕX (t) = E[eit

TX]

=

∫Rp

eitTXdFX (x)

ϕY (s) = E[eis

TY]

=

∫Rq

eisTYdFY (y)

and

ϕX,Y (t, s) = E[eit

TX+isTY]

=

∫Rp+q

eitTX+isTYdFX (x)FY (y)

Then, the equivalent null hypothesis is H0 : ϕX,Y (t, s) = ϕX (t)ϕY (s) for all t ∈ Rp, s ∈Rq. Szekely et al. [2007] define distance covariance as a new measurement of depen-dence between random variables, which is the distance between the joint characteristicfunction and the product of the marginal characteristic functions on L2 space or Hilbertspace. ‖ϕX,Y (t, s) = ϕX (t)ϕY (s)‖L2

≥ 0 and equality holds if and only if ϕX,Y (t, s) −ϕX (t)ϕY (s) = 0 that is an equivalent null hypothesis of independence. Then, we havethe following definition.

Definition 2.5.1. A measure of dependence of random variables X and Y is a weightedsquared distance with a suitable choice of weight function ω (t, s), and it is defined by

V2 (X,Y;ω) = ‖ϕX,Y (t, s) = ϕX (t)ϕY (s)‖2L2,ω

=

∫Rp+q

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2 ω (t, s) dt ds (2.16)

such that V2 (X,Y;ω) = 0 if and only if X and Y are independent where |f |2 = ff , fdenotes the complex conjugate of a complex-valued function f and ω (t, s) is an arbitrarypositive function for which the integral above exists.

There are many weight functions satisfying the conditions, but not every one will resultin a interesting measurement of dependence and the corresponding empirical form. In thefollowing parts, you will gradually realize that the one selected by Szekely et al. [2007] isa reasonable weight function. The lemma below is an essential component of the followingderivations, and it is shown in paper of Szekely and Rizzo [2005].

Lemma 2.5.1. If 0 < α < 2, then ∀t,x ∈ Rd∫Rd

1− cos(tTx)

‖t‖d+αdt = C(d, α) ‖x‖α (2.17)

21

where

C(d, α) =2πd/2Γ (1− α/2)

α2αΓ ((d+ α) /2)

where Γ (·) is the gamma function and ‖·‖ is the Euclidean norm in Rd.

Let cd = C(d, 1) = π(1+d)/2

Γ((1+d)/2), and then for all t and x in Rd

‖x‖ =

∫Rd

1− cos(tTx)

cd ‖t‖d+1dt

In order to make sure existence of the integral in distance definition, ‖ϕX,Y (t, s)− ϕX (t) · ϕY (s)‖2L2,ω

has to be finite. By Cauchy–Bunyakovsky–Schwarz inequality,

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2

=∣∣∣E [(eitTX − ϕX (t)

)(eis

TY − ϕY (s))]∣∣∣2

≤ E

[∣∣∣eitTX − ϕX (t)∣∣∣2]E [∣∣∣eisTY − ϕY (s)

∣∣∣2]= E

[∣∣(cos (tTX)

+ isin(tTX

))− ϕX (t)

∣∣2]E [∣∣(cos (sTY)

+ isin(sTY

))− ϕY (s)

∣∣2]By Euler’s Formula

≤(1− |ϕX (t)|2

) (1− |ϕY (s)|2

)=

(1− E

[cos(tT (X−X′)

)]) (1− E

[cos(sT (Y −Y′)

)])Detail is given in Appendix. If E [‖X‖+ ‖Y‖] <∞, then by Definition 2.16 and Fubini’s

22

theorem,

V2 (X,Y;ω)

=

∫Rp+q

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2 ω (t, s) dt ds

≤∫Rp+q

(1− |ϕX (t)|2

) (1− |ϕY (s)|2

)ω (t, s) dt ds

=

∫Rp+q

(1− |ϕX (t)|2

) (1− |ϕY (s)|2

) 1

cpcq ‖t‖1+p ‖s‖1+q dt ds

=

∫Rp

1− |ϕX (t)|2

cp ‖t‖1+p dt

∫Rq

1− |ϕY (s)|2

cq ‖s‖1+q ds

=

∫Rp

1− E[cos(tT (X−X′)

)]cp ‖t‖1+p dt

∫Rq

1− E[cos(sT (Y −Y′)

)]cq ‖s‖1+q ds

= E

[∫Rp

1− cos(tT (X−X′)

)cp ‖t‖1+p dt

]E

[∫Rq

1− cos(sT (Y −Y′)

)cq ‖s‖1+q ds

]= E [‖X−X′‖]E [‖Y −Y′‖] <∞ (By Lemma 2.5.1)

where X′ and Y′ are independent copies of X and Y respectively. V2 (X,Y;ω) < ∞ ifE [‖X‖] and E [‖Y‖] are finite. Thus, the weight function they choose

ω (t, s) =1

cpcq‖t‖1+p‖s‖1+q

where cd = π(1+d)/2

Γ((1+d)/2)and ‖ · ‖ is the Euclidean norm as their weight function. Existence

of ‖ϕX,Y (t, s)− ϕX (t) · ϕY (s)‖2L2,ω

is guaranteed due to choice of this weight function.Plugging the weight function into Equation 2.16 we have the official definitions of distancecovariance and distance correlation coefficient. Since the measure of dependence is basedon distance on L2 or Hilbert space, so it is named for its nature and analogy.

Definition 2.5.2. The distance covariance (dCov) between random vectors X and Y withfinite first moments is a non-negative number V (X,Y) defined by

V2 (X,Y) =1

cpcq

∫Rp+q

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2

‖t‖1+pp ‖s‖1+q

q

dt ds (2.18)

Similarly, assuming that t1, t2 ∈ Rp, the distance variance (dVar) of X is the squared rootof

V2 (X) = V2 (X,X) =1

c2p

∫R2p

|ϕX,X (t1, t2)− ϕX (t1) · ϕX (t2)|2

‖t1‖1+p‖t2‖1+pdt1 dt2 (2.19)

23

And assuming that s1, s2 ∈ Rq, the distance variance (dVar) of Y is the squared root of

V2 (Y) = V2 (Y,Y) =1

c2q

∫R2q

|ϕY,Y (s1, s2)− ϕY (s1) · ϕY (s2)|2

‖s1‖1+q‖s2‖1+qds1 ds2 (2.20)

Definition 2.5.3. The distance correlation coefficient (dCor) between random vectors Xand Y with finite first moments is a non-negative number R (X,Y) defined by

R2 (X,Y) =

{V2(X,Y)√V2(X)·V2(Y)

V2 (X)V2 (Y) > 0

0 V2 (X)V2 (Y) = 0(2.21)

The distance covariance(dCov) and distance correlation coefficient(dCor) are analogiesto moment-product covariance and correlation coefficient. Suppose we have observations(X,Y) = {(Xk,Yk), k = 1, . . . , n} where X ∈ Rp and Y ∈ Rq, and then the empiricaljoint characteristic function and the marginal characteristic functions of X and Y can bedefined as follow.

Definition 2.5.4. The empirical joint characteristic function of X ∈ Rp and Y ∈ Rq isdefines as

ϕnX,Y (t, s) =1

n

n∑k=1

eitTXk+isTYk

=1

n

n∑k=1

ei(tTXk+sTYk)

=1

n

n∑k=1

(cos(tTXk + sTYk

)+ isin

(tTXk + sTYk

))Definition 2.5.5. The empirical marginal characteristic functions of X ∈ Rp and Y ∈ Rq

are respectively defined as

ϕnX (t) =1

n

n∑k=1

eitTXk =

1

n

n∑k=1

(cos(tTXk

)+ isin

(tTXk

))and

ϕnY (s) =1

n

n∑k=1

eisTYk =

1

n

n∑k=1

(cos(sTYk

)+ isin

(sTYk

))

24

It is very natural to have the following definitions of empirical distance covarianceand variance. For convenience, computational forms of empirical distance covariance andvariance are provided as well. Let akl = ‖Xk −Xl‖ and bkl = ‖Yk − Yl‖, and then

Definition 2.5.6. Assuming t ∈ Rps ∈ Rq, the empirical distance covariance of tworandom vectors X and Y with finite first moments is defined as

V2n (X) = V2

n (X,X)

V2n (X,Y) =

∥∥ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)∥∥2

L2,ω

=1

cpcq

∫Rp+q

∣∣ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)∣∣2

‖t‖1+p‖s‖1+qdt ds

=1

n2

n∑k=1

n∑l=1

AklBkl

Assuming t1, t2 ∈ Rp, the empirical distance variance of random vector X with finite firstmoment is defined as

V2n (X) = V2

n (X,X)

=∥∥ϕnX,X (t1, t2)− ϕnX (t1)ϕnX (t2)

∥∥2

L2,ω

=1

c2p

∫R2p

∣∣ϕnX,X (t1, t2)− ϕnX (t1) · ϕnX (t2)∣∣2

‖t1‖1+p‖t2‖1+pdt1 dt2

=1

n2

n∑k=1

n∑l=1

A2kl

And assuming s1, s2 ∈ Rq, the empirical distance variance of random vector Y with finitefirst moment is defined as

V2n (Y) = V2

n (Y,Y)

=∥∥ϕnY,Y (s1, s2)− ϕnY (s1)ϕnY (s2)

∥∥2

L2,ω

=1

c2q

∫R2q

∣∣ϕnY,Y (s1, s2)− ϕnY (s1) · ϕnY (s2)∣∣2

‖s1‖1+q‖s2‖1+qds1 ds2

=1

n2

n∑k=1

n∑l=1

B2kl

where Akl = akl − ak· − a·l + a·· and Bkl = bkl − bk· − b·l + b··.

25

The derivation of the computational forms of empirical distance covariance and varianceis in Appendix.

Definition 2.5.7. The empirical distance correlation coefficient (dCor) between randomvectors X and Y with finite first moments is a non-negative number Rn (X,Y) defined by

R2 (X,Y) =

{V2n(X,Y)√

V2(X)·V2n(Y)

V2n (X)V2

n (Y) > 0

0 V2n (X)V2

n (Y) = 0(2.22)

Notice that

V2n (X) = V2

n (X,X)

=1

n2

n∑k=1

n∑l=1

A2kl

=1

n2

n∑k=1

n∑l=1

(akl − ak· − a·l + a··)2 = 0

if and only if Xk = Xl ∀k, l = 1, . . . , n. Assuming all observations are equal, it immediatelyimplies that V2

n (X) = 0. If V2n (X) = 0, then akl − ak· − a·l + a·· = 0 ∀k, l. Hence

Akk = akk − ak· − a·k + a·· = −ak· − a·k + a·· = 0. Since the matrix [akl] is symmetric, soak· = a·k = a··

2. Thus, for any k and l, Akl = akl − ak· − a·l + a·· = akl = ‖Xk −Xl‖ = 0.

In the Theorem 1, Theorem 2 and Corollary 1 of Szekely et al. [2007], they alsoprove the asymptotic properties of V2

n (X,Y) and R2n (X,Y), both statistics converge to

its own theoretical value. The last but not the least, the test of dependence by correlationof distances rejects H0: X and Y are independent if

nV2n (X,Y)

a··b··>(Φ−1 (1− α/2)

)2

which has an asymptotic significance level at most α ∀0 < α ≤ 0.215.

Remarks:

• As the authors claim, this test is sensitive to all types of dependence structure in-cluding non-linear and non-monotonic dependence structure.

• This test is a multivariate non-parametric test. Random variables X and Y are notnecessarily one-dimensional or have the same dimension.

26

• Distance covariance and distance correlation coefficient are analogous to product-moment covariance and correlation coefficient. Thus, it becomes easier for people tounderstand the material.

2.6 Summary

All test methods discussed above have a lot of properties in common. All equivalent nullhypotheses of independence are a special case of

E [H (X)K (Y )] = E [H (X)]E [K (Y )]

for all functions H (·) and K (·) such that both individual expectation exists. For first twomethods, one is based on cumulative distribution functions and anther is based on copulafunctions. The empirical version of both functions are related to the rank of data. Thetest statistic of Hoeffding’s independence test is

Dn =A− 2 (n− 2)B + (n− 2) (n− 3)C

n (n− 1) (n− 2) (n− 3) (n− 4)

where

A =n∑

α=1

(RX,α − 1) (RX,α − 2) (RY,α − 1) (RY,α − 2)

B =n∑

α=1

(RX,α − 2) (RY,α − 2) (min (RX,α, RY,α)− 1)

C =n∑

α=1

(min (RX,α, RY,α)− 1) (min (RX,α, RY,α)− 2)

and the test statistic in the test of independence based on the empirical copula process is

Tn =1

n

n∑i=1

n∑j=1

(1

3+

1

2

(Rxi

n

)2

+1

2

(Rxj

n

)2

−max

(Rxi , Rxj

)n

)(

1

3+

1

2

(Ryi

n

)2

+1

2

(Ryj

n

)2

−max

(Ryi , Ryj

)n

)From Pearson’s correlation coefficient to the Spearman’s correlation coefficient, the rank ofdata is efficient in detecting dependence structure of data. The last but not the least, the

27

distance correlation coefficient provides us a new point of view. From the test statistic, wecan see that we are calculating the distance between data points. For a linear relationshipbetween two random variables X and Y , the distance between any two data points in Xhas a linear relationship of the distance between any two data points in Y as well. Fora monotonic relationship between two random variables X and Y , between any two datapoints, the difference inX varies as the difference in Y . Motivated by their idea, the methodwe proposed in this thesis extends their measurements to a visual test of independence.

28

Chapter 3

Visual Display of Dependencebetween Random Variables

Human beings develop a very complicated and powerful visual system to help them huntfor preys, avoid dangers and discover the world. Even the most advanced camera we inventhas less resolution than our eyeballs. Our eyes can catch every tiny detail surrounding us.Moreover, with brain, the human visual system can fill in the part blocked by a obstacle forexample, quickly spot the differences between two pictures with very slight modificationsand recognize the pattern of several pictures accurately. In machine learning, a very hotand attractive assignment is pattern recognition. There are many algorithm including bothsupervised and unsupervised introduced to accomplish this task. None of them do a betterjob than our visual system does, although no models will be more complex than our visualsystem. On the other hands, people may say that human visual system is developed formillions of years, but models are just introduced in decades. Eventually, we will realizethat we already have a powerful and well-trained data-analytic tool. Human obviouslysmart enough to take advantage of it. In the history of human being, a simple symbolcould be very first visual expression of object and later they evolve to script, such asBabylon cuneiform, Egyptian Hieroglyph and ancient Chinese writing. Even the currentlyused traditional and simplified Chinese characters are developed from the ancient Chinesewriting. In a recent review of Friendly [2008], he talks about milestones of thematiccartography, statistical graphics and data visualization in the history. Map is a visualrepresentation of region surrounding the people. People start drawing maps back to 8, 000years ago. However, until 17th century, because of developments of physical measurementtool and analytic geometry, visualization makes a great progress in order to present realdata in a meaningful and precise way. Therefore, he claims that 17th century may be

29

the beginning of visual thinking. Then, data visualization has grew rapidly. However,the first half of 20th century, it suffers from a “dark age”. Due to the rise of the formaland exact statistical model, people move their focus away from data visualization. There-birth of data visualization comes in the second half of last century. As a result ofadvent of computers and other innovations of visual tools, data visualization makes a bigleap forward. The need of identification of data visualization as a branch of statistics isinitialized in a landmark paper, ”The Future of Data Analysis”, by Tukey [1962]. Besides,I also would like to quote a statement from Albert Einstein:

Logic will get you from A to B, but imagination will take you everywhere.

Mathematical test of independence follows the mathematical logic perfectly. However,we are constricted in quantitative world. Data visualization provides us another tool andaspect to gain deeper insights of data than mathematical tool. Besides, projection, rotation,inversion and other movements or transformations of data will take us to a new stage ofviewing data. The first two subsections will discuss two data visualization tools specializingin categorical data: mosaic plot and eikosogram. Validation of visual statistical inferencewill be studied in the third section. The last but not the least, the proposed methods areintroduced.

3.1 Mosaic Display and Its Extended Version

A contingency table is a table displaying frequency distribution of the variables. The asso-ciation of two variables in a two-way contingency table can be assess by lots of traditionaltests, such as Pearson’s Chi-squared test and Barnard’s test Barnard [1945]). Other thantesting the association by quantitative tests, later a way to visualize the contingency tableis invented. Mosaic display for two-way contingency table is introduced by Hartigan andKleiner [1981]. Each cell of the table is represented by a rectangle whose area is propor-tional to the cell frequency. It helps people make visual comparison and highlight unusualcells more easily than seeking within numbers. In addition, it is natural to extend tomulti-way contingency table. However, defects appears in an example given in their paperwith a mosaic plot of a four-way contingency table of size 3× 6× 7× 12. There are 1512cells in total, and thus it is no longer easy to screen the display. Friendly [1994] extendsthe mosaic display to a data-analytic tool in two ways. First, fitting a baseline model ofindependence from a given display, color and shading of the rectangles is used to revealdepartures from the model. Secondly, he shows that reordering will enhance the pattern of

30

association perceptually. A general scheme for reordering categories of a variables is basedon a singular value decomposition (SVD) of residuals from independence. There are foursubsections. The first one will briefly talk about construction of (two-way) mosaic display.Then the how coloring, shading and reordering will enhance the perception of pattern willbe discussed in the second subsection. A multi-way mosaic plot is introduced in subsection3. There are several examples provided in the last subsection.

3.1.1 Construction of Mosaic Display

Let ni,j denote the cell in the ith row and jth column of the frequency table of two variables

X and Y where i = 1, . . . , I and j = 1, . . . , J . As usual, ni· =∑J

j=1 ni,j, n·j =∑I

i=1 ni,j

and n·· =∑I

i=1

∑Jj=1 ni,j. Thus, the estimated frequency is denoted by mi,j =

ni·n·jn··

undernull hypothesis of independence. The following table shows the example in which data iscollected by Snee [1974].

Table 3.1: Hair Color vs Eye Color Data

Hair ColorEye Color Black Brown Red Blond Total

Brown 68 119 26 7 220Blue 20 84 17 94 215Hazel 15 54 14 10 93Green 5 29 14 16 64Total 108 286 71 127 592

The following is the mosaic display for above contingency table:

31

Figure 3.1: Mosaic Plot of Hair Color vs Eye Color Data

On way to look at mosaic display is viewing it as divided bar chart. The width of eachtile is proportional to the marginal column sum, and the height of each row is proportionalto the conditional frequency of each row given the column. For example, assuming themosaic is an unique square, the width of ni,j tile is

n·jn··

, and the height isni,j

n·j. Thus, the

area is width × height =n·jn··

ni,j

n·j=

ni,j

n··, which is also the proportion of cell frequencies.

From the mosaic display, there are some interesting findings that can be easily detected,especially comparing to an independent mosaic display or so-called a baseline model ofindependence. In an independent mosaic display, the area of each rectangle is proportional

32

to mi,j =ni·n·jn··

. The width of ijth tile is proportional to n·j and the height is proportionalto ni·. The independent mosaic display as a reference and the mosaic display of the dataare shown as below:

Figure 3.2: Independent Mosaic Plot Figure 3.3: Mosaic Plot

Comparing to the reference mosaic display, it may be much easier to detect unusualcells than just looking at the mosaic plot. The followings are findings:

• People with black hair are more likely to have brown eyes than under independenceof hair color and eye color.

• There are more people with blond hair and blue eyes and fewer people with blondhair and brown eyes than under independence of hair color and eye color.

• Otherwise, there is no obvious difference.

For Pearson’s Chi-square test, we denote the standard deviation from independence as

di,j =ni,j − mi,j√

mi,j

=ni,j − ni·n·j/n··√

ni·n·j/n··(3.1)

and test statistics as

x2 =I∑i=1

J∑j=1

d2i,j =

I∑i=1

J∑j=1

(ni,j − ni·n·j/n··√

ni·n·j/n··

)2

∼ χ2 (degree of freedom) (3.2)

33

In this example, the test statistic x2 = 138.29 with 9 degrees of freedom and the corre-sponding p − value < 2.2 × 10−16 according to R output. Thus, the null hypothesis ofindependence is rejected. At the same time, the mosaic display, however, help us muchmore easily and quicker find out the dependence of hair color and eye color than by Pear-son’s Chi-square test.

3.1.2 Extension of Mosaic Display

Although from the above example we may draw a conclusion that mosaic is a good andaccessible tool, mess of row and column order and single-color design will reduce its utilitya lot. Friendly [1994] enhance the mosaic plot by coloring the tiles and put tiles in acoherent order. The following figure is an enhanced mosaic display of above example:

Figure 3.4: Colored Mosaic Plot of Hair Color vs Eye Color Data

Cells are colored according to the standard deviation from independence di,j, but cellswith di,j < 2 are note colored. Cells with positive deviations are drawn blue, and whilecells with negative deviations are drawn red. Within those cells, those with di,j > 4 arefilled with darker color. These standard deviations follow standard Gaussian distribution.Thus, the value 2 and 4 correspond to two-tailed probabilities p < 0.05 and p < 0.0001

34

under null hypothesis of independence. With this coloring technique, it makes much easierto detect unusual tiles in mosaic display than the original mosaic display.

In addition, to help us detect pattern of mosaic display, Friendly [1994] also propose anapproach to reorder the categories. The approach is based on the idea of correspondenceanalysis(CA). Following the previous setting, ni,j represents the cell in the ith row and jth

column of the frequency table, and then we also can have a matrix N = [ni,j]. The row

profile of matrix N is defined as ai =[ni,1

ni·· · · ni,J

ni·

]T. The distance within the row profiles

is very interesting to us. Instead of using the Euclidean metric, the distance is based onthe so-called Chi-squared metric, and is defined as

dc (ai, ai′) = (ai − ai′)T D−1

c (ai − ai′)

=J∑j=1

(ni,j/ni· − ni′,j/ni′·)2

n·j/n··

where Dc is a diagonal matrix whose elements cj =n·jn··

j = 1, . . . , J . The vector c =

[c1 · · · cJ ]T =[n·1n··· · · n·J

n··

]Tis called the row centroid, since c =

∑Ii=1 riai where ri = ni·

n··

which can be viewed as weighted average of row profiles. The row has more weight ifthe row has larger proportion than other rows. And we call ri the row mass. Recall test

35

statistic for Pearson’s Chi-square test in Equation 3.2,

x2 =I∑i=1

J∑j=1

(ni,j − ni·n·j/n··√

ni·n·j/n··

)2

∼ χ2(degree of freedom)

=I∑i=1

J∑j=1

(ni,j − ni·n·j/n··)2

ni·n·j/n··

=I∑i=1

J∑j=1


ni·n·j/n··rin··ni·

= n··

I∑i=1

ri

J∑j=1


n2i·cj

= n··

I∑i=1

ri

J∑j=1

(ni,j − ni·n·j/n··

ni·

)21

cj

= n··

I∑i=1

ri

J∑j=1

(ni,jni·− n·jn··

)21

cj(3.3)

= n··

I∑i=1

ri (ai − c)T D−1c (ai − c)

= n··

I∑i=1

ridc(ai, c)

Therefore, x2

n··is weighted average of Chi-squared distances between the row profiles to

their centroid. For the Pearson’s Chi-square test, the null hypothesis of independenceis ni,j =

ni·n·jn·· ∀i, j. However, in this setting, if look at Equation 3.3 we still have an

equivalent one, which isn1,j

n1·=

n2,j

n2·= · · · = nI,j

nI·∀j. Inspired by a review of the geometric

interpretation of correspondence analysis written by Greenacre and Hastie [1987], in orderto rearrange the categories Friendly [1994] assigns scores to the categories such that thePearson correlation of the optimally scaled variables is maximized. Considering the matrix

D =[di,j√n··

]=[ni,j−ni·n·j/n··√

ni·n·j

], the singular value decomposition of D could be D = UΣV.

Σ is a I × J rectangular diagonal matrix with λm (m = 1, . . . ,M where M = min (I, J))lined on its diagonal where λ1 ≥ · · · ≥ λM . The elements of U and V are ui,i′ ∀i, i′ =1, . . . , I and vj,j′ ∀j, j′ = 1, . . . , J respectively. If we only consider the first principal

component, thendi,j√n··

= λ1ui,1v1,j. Then, ui,1 i = 1, . . . , I and v1,j j = 1, . . . , J are defined

36

as the scores of row categories and the scores of column categories respectively. The authoralso mentions that the pattern will be best shown if categories are reordered in accordancewith scores of rows and columns.

3.2 Picturing Probability: Eikosogram

Eikosogram is introduced by Cherry and Oldford [2003]. Prof. G.W.Bennett of the Univer-sity of Waterloo suggest the name eikosogram, which consist of two classical Greek wordseikos(probability) and gramma(drawing or writing). An eikosogram is a unit square whosearea represents probability 1. The square then is horizontally divided into non-overlappingrectangles. The number of those rectangles relies on the how many distinct values therandom variable takes. The Figure 3.26 below will shows a univariate example with abinary random variable Y where Y = y or n. Coloring is used to distinguish values takenby random variable.

Figure 3.5: A Univariate Example of Eikosogram

In Figure 3.26, we can clearly see that P (Y = y) = 1/3 and P (Y = n) = 2/3. It isnatural to extend eikosogram to multivariate cases. In order to keep the same definitionin one-dimensional eikosogram, the area of rectangle represents the probability of corre-sponding probability, conditional probability is applies to eikosogram. The eikosogrambelow is an extension of the above one, in which an conditioning binary random variableis introduced. Considering the following contingency table:

37

Yyes no Total

X yes 10 10 20no 30 70 100

Total 40 80 120

And the corresponding eikosogram is

Figure 3.6: A Bivariate Eikosogram Conditional on X

The eikosogram is first vertically divided into two rectangles according to P (X = yes)and P (X = no), and then each sub-region is horizontally separated according to condi-tional probability P (Y = yes|X = x) and P (Y = no|X = x). For each resulting rectan-gle, it has height P (Y = y|X = x) and width P (X = x), and thus the area is height ×width = P (Y = y|X = x)P (X = x) = P (X = x, Y = y). Thus, by the representation,∑

all x

∑all y P (X = x, Y = y) = 1 which is the area of the unique square. The marginal

probability is produced by total area of strip, which is P (X = x) =∑

all y P (X = x, Y = y)in mathematical notation. All conditional and marginal probabilities of X and Y are con-tained in this plot. Thus, in a bivariate eikosogram, the area always represents the jointprobability of the corresponding values of two variables and the conditional probability isalways conditional on the variable the vertical axis. From Figure 3.6, we can quickly tellP (X = yes) = 1/6 and P (X = no) = 5/6. What if we want to know P (Y = yes) andP (Y = no) and what if we are more interested in probability conditional on Y ? Thanksto symmetry of the unique square and the contingency table, if we switch the variables Xand Y , then we have the followings:

38

Xyes no Total

Y yes 10 30 40no 10 70 80

Total 20 100 120

and the corresponding eikosogram is

Figure 3.7: A Bivariate Eikosogram Conditional on Y

Notice that the area of each cell is actually not changed, but the shape and the position.Thus, probability is invariant under change of perspective. However, like the one condi-tional on X, the above eikosogram contains all conditional and marginal probabilities of Xand Y . In addition, eikosogram can be applied to revealing association between variables.Because in binary eikosogram the height of each region is represented by a conditional prob-ability, the association can be easily seen from the display. The following examples willshow you typical eikosograms of perfect association, part perfectly association, associationand independence for binary variable case:

• Perfect Association

– Perfectly Positive Association

39

Figure 3.8: EikosogramConditional on X of Per-fectly Positive Association

Figure 3.9: EikosogramConditional on Y of Per-fectly Positive Association

– Perfectly Negative Association

Figure 3.10: EikosogramConditional on X of Per-fectly Negative Association

Figure 3.11: EikosogramConditional on Y of Per-fectly Negative Association

• Part Perfect Association

– Part Perfectly Positive Association

40

Figure 3.12: EikosogramConditional on X of PartPerfectly Positive Associa-tion

Figure 3.13: EikosogramConditional on Y of PartPerfectly Positive Associa-tion

– Part Perfectly Negative Association

Figure 3.14: EikosogramConditional on X of PartPerfectly Negative Associ-ation

Figure 3.15: EikosogramConditional on Y of PartPerfectly Negative Associ-ation

• Association

– Positive Association

41

Figure 3.16: EikosogramConditional on X of Posi-tive Association

Figure 3.17: EikosogramConditional on Y of Posi-tive Association

– Negative Association

Figure 3.18: EikosogramConditional on X of Neg-ative Association

Figure 3.19: EikosogramConditional on Y of Neg-ative Association

• Independence

42

Figure 3.20: EikosogramConditional on X of Inde-pendence

Figure 3.21: EikosogramConditional on Y of Inde-pendence

In case of test of independence, when eikosogram is drawn according to data, if it is similaror close to the form of independence shown above, then null hypothesis of independencecan be accepted. Moreover, as mentioned in the literature by Cherry and Oldford [2003],the concept of independence is easily confound with the concept of mutually exclusive.The Venn’s diagram represents mutually exclusive by two separated circles. By definition,two events are said to be mutually exclusive if they cannot occur at the same time. Thus,two mutually exclusive events are entirely dependent because the occurrence of one eventwill lead to other event not occurring. The following is an example of binary eikosogramof mutually exclusive:

43

Figure 3.22: EikosogramConditional on X of Mutu-ally Exclusive

Figure 3.23: EikosogramConditional on Y of Mutu-ally Exclusive

Therefore, the eikosogram is a very rich diagram since it contains conditional, marginal,joint probabilities as well as associations of variables. Most examples listed above are thesimplest case in which eikosogram has two binary variable. Not surprisingly, there is amultivariate eikogram with multiple values. Considering the following contingency tableand the corresponding eikosogram:

Xa b c Total

Y a 10 30 50 90b 70 10 20 100c 20 30 10 60

Total 40 130 80 250

44

Figure 3.24: EikosogramConditional on X

Figure 3.25: EikosogramConditional on Y

In an eikosogram with multiple values, coloring and shading techniques are used todistinguish different values of variables. The association of variables also can be revealedby the eikosogram.

In addition, a multivariate eikosogram can be used to discover the conditional depen-dence structure of data itself. A three-dimensional eikosogram would be a good examplehere. The definition of eikosogram remains the same. The key idea here is that the areaof each region is always joint probability. I will explain through the following example:

Figure 3.26: A Univariate Example of Eikosogram

The height and the width of bottom-left green region, in this example, will be condi-

45

tional probability P (X = yes|Y = yes, Z = yes) and the joint probability P (Y = yes, Z = yes)respectively. The marginal probability can be obtained by summation. P (Z = z) =∑

all y P (Y = y, Z = z) and thus P (Z = yes) = P (Y = yes, Z = yes)+P (Y = no, Z = yes) =6/25 + 9/25 = 3/5 that is the sum of the length of first two segments. The sum of thelength of the first and third segments will give you P (Y = yes) = P (Y = yes, Z = yes) +P (Y = yes, Z = no) = 6/25 + 6/25 = 12/25. The marginal probability of the variable onvertical axis, which is X in this example, is always obtained by summarizing the area ofregion differed by color. For instance, P (X = yes) = the area of the green regions. Dueto the properties of eikosogram, like binary eikosogram, the multivariate one is also goodat revealing the dependence structure as well as conditional dependence structure of data.

For conditional independence of three variables case, there are four types of indepen-dence structures:

• X is independent of Y Conditional on Z (X⊥⊥Y |Z)

Figure 3.27: 3-D Eikoso-gram Conditional on Xand Z

Figure 3.28: 3-D Eikoso-gram Conditional on Y andZ

Figure 3.29: 3-D Eikoso-gram Conditional on Xand Y

• X is independent of Y Conditional on Z Value (X⊥⊥Y | [Z = yes] but XY | [Z = no])

46




• Two-way Independence (X⊥⊥Y |Z and Y⊥⊥Z|X, but XZ)




In fact, there are more information we can obtain from above three displays. Fromthe first displays, we can see that X⊥⊥Y |Z and Y⊥⊥Z|X, and both statement cometogether will imply X⊥⊥Y and Y⊥⊥Z that also can be shown from the second display.Again, eikosogram is a rich visual tool.

• Three-way Independence (⊥⊥ (X, Y, Z))

47




Of course, there is a more complicated example, since you can add as many variables andas many values for each variable to eikosogram as you want. The eikosogram is mucheasier to view than a contingency table, especially to detect dependence structure betweenrandom variables, and the eikosogram has a complete mathematical interpretation.

3.3 Validation of Visual Statistical Inference

There are a lots of statistical displays or data visualization methods introduced. Followingthe instructions, viewer or tester will obtain the outcome of the test quickly. In general,most of data visualization tools are easy to pick and require much less mathematicalknowledge than quantitative methods. However, all data visualization tools are empirical.Data expert definitely does better than green hand. None of data visualization methodsgive us a statistical significance level. A statistical significance level may furnish peoplestrength or confidence of a test outcome, which resulting in so-called formal and accuratestatistical tests. A primary reason why data visualization is out of favor in the first half of20th century is due to lack of not furnishing a significance level. However, exploratory dataanalysis and model diagnostics primarily depends on plots. Thus, researchers put lots ofefforts on this topic. Eventually, the situation is changed until the work of Buja et al. [2009].In their paper, framework and two protocols are incorporated into the visual displays. Thebelow parallelism briefly reveals how we establish an analogy to the quantitative test:

48

As you we know, in the quantitative tests, the test statistics are random variables. Thenull hypothesis is rejected when the probability of observing the statistic is smaller thanpre-setting significance level assuming that the null hypothesis is true. While in the visualsetting, we need to answer the following two question: what does “discovery”” mean?When do we reject the null hypothesis? The “discovery” here does not means Columbusdiscover the new continent, but it likes that Holmes find a suspect in the scene. It isvery suspicious that a cigarette butt is found in the scene but the victim never smoke; bycontrast, butts found in a bar are quite normal. Therefore, in different setting or basedon different null hypothesis, the definition of discovery varies. For example, in the modeldiagnostics of a linear regression model, when we check the normal probability plot, wefocus on how well the residuals fit the reference line. Departures from the reference line willlead to a rejection of normality assumption. There are two points I’d like to mention in thisexample: the null hypothesis sharpen the understanding of “discovery”, and the referenceline provide us a clear guide to look for the “discovery”. However, not all null hypotheseswill furnish such a good instruction. In the model diagnostics, another plot we frequentlylook is fitted values versus residuals. It is used to check the independence of residuals.However, there are many “discoveries” that can fail the assumption of independence, suchas discovery of linear, monotonic or any other types of association. However, the latterexample is a more general case, and thus it results in a problem of uncertainty. Theresponse of a human viewer is based on his or her past experience. Nevertheless, sameproblem occurs here. Even a data expert with rich experience have not seen all types ofdata or surely identified the suspect. Buja et al. [2009] point out a critical divergence

49

between quantitative testing and visual discovery:

quantitative testing requires the explicit prior specification of the intended “dis-covery”; by contrast, the range of visual discoveries is not pre-specified explic-itly.

A natural way to address pre-specification is that listing all discoverable features andformalizing them in terms of the test statistics, such as the coloring technique in theextended mosaic plot. However, as mentioned above, a list of discoverable features is neverand ever complete. To solve this puzzle, Buja et al. [2009] propose that pre-specificationof discoverable features are not necessary. Prior experience is not always an asset, butit also leads to empiricism error. Thus, the training provided to views can be done on-site, which means plots under the null hypothesis will be given as references. Those plotsare called “null plots”. The simplest re-sampling method used to generate null plots isaccording to the theory of permutation test, and it can be applied to construction of plotsunder the null hypothesis of independence. For more complicated H0, other methods arealso available: conditional sampling, parametric bootstrap sampling or Bayesian posteriorpredictive sampling. Those null plots, as references, will give us a rough idea of nullhypothesis. A natural question is raised, which is how we use them in an efficient way.Buja et al. [2009] introduce two protocols as guidances of the procedure. There are twoessential components in the following protocols: data administrator and data analyst.The role of data administrator is generating the null plots and providing contextual priorinformation. In order to avoid bias, data analyst should be blind all the time until theyalready make a decision whether declare a discovery or not. Moreover, data administratorshould be blinded as well because their action and the way to ask questions or content ofquestions may misunderstand decisions made by data analyst. With careful handling, theprocedure can be monitored in a self-administrating protocol.

Protocol I: RorschachBecause it is test of null hypothesis, when a data analyst view the plots, they havetendency to declare discovery even there is no structure or only spurious structurein plots. Thus, the visual test is very subjective and biased by analyst’s prior expe-rience. Rorschach test is named after Swiss psychologist Hermann Rorschach, whichis originally a general personality test. In this case, the idea is extended to studyon factors affecting tendency of a human viewer and to avoid over-interpretation. Ahuman viewer is provided a number of null plots one by one, and is asked to declarethe discovery or not. The proportion of declaring discover is actually an estimatorof Tyep I error. Thus, this protocol will produce an estimate of Type I error, but it

50

is not able to control the test at a desire level, because every data analyst has theirown rate of discovery.

Protocol II: Line-upAlthough last protocol will produce a Type I error for each tester, the quality oftester’s judgement is out of control. What if we have a non-professional data analyst?The line-up protocol can produce a significance level. The line-up protocol is namedafter police line-up. For example, 19 null plots are generated and plot of real data isinserted into null plots. Then, the data analyst is asked to try to pick out the realdata by detecting its specifics. If the real data is also from null hypothesis, then itis not distinguishable from other null plots , and the probability of choosing the realdata is assigned to a p−value = 1/20 = 0.05. The more null plots are generated, thesmall the p-value is and the stronger the test is. However, a large number of plots willbe a burden of a human viewer, people gradually lose their patience and carefulness.Eventually, the accuracy and quality of data analyst’s judgement will drop. Fromanother point of view, the whole is like a classification problem, and those null plotsfurnish you much information about null hypothesis. For example, given a picture ofan Asian elephant, without any knowledge about elephants, it is very difficult to tellthere is an Asian elephant or an African elephant in the picture. However, if you have19 pictures of African elephants, you will quickly realize the difference between thesetwo group. It becomes easier if the Asian elephant in picture is female, since femaleAsian elephant does not have tusk. There are some advantages of this protocol:

1. We mention that the list of display features is never and ever complete, andthus instead of comparing the features listed we are looking for any features orany specifics real data possesses. Sometimes, even intuition works due to ourunconscious mind. This is also a merit of visualizing analytic tool, since thehuman being is included in test mechanism.

2. With careful handling, this protocol can be self-administrated as well. Random-ization and recording location of real data plot by a third-party(computer) willhelp you carry out the procedure by yourself. However, every coin has two sides.The protocol is non-renewable, which means display collections of the secondrun of the procedure will include the real data plot that is known from the firstrun. Therefore, the choice of data analyst is not independent since auxiliaryinformation provided.

3. Although the procedure cannot be repeated by one person, multiple independentpeople are not influenced by memory the plot of real data. The whole processcan be viewed as Bernoulli trial. Suppose we recruit k people and n − 1 null

51

plots are generated, and let M be the number of people select the plot of realdata successfully. Therefore, M |H0 ∼ Binomial

(k, 1

n

). Thus, p − value =

P (M |H0 > m).

The above merits of line-up protocol make it applicable. The following examples will showhow this protocol works and how we carry out a visual test. Solutions and explanations toeach display will be revealed following all three examples.

Example 3.3.1. This example is the housing price with other potential factors affecting theprice for the Boston area in the 1970s. For each plot, the x-axis and y-axis represent crimerate per capita by town and median value of owner-occupied homes in $1000s respectively.Our null hypothesis for this specific example is that whether housing price is independentof crime rate per capita by town. To break the dependence structure of two variables and togenerate null plots, housing price is permuted. The task for viewer is picking up the most“special” one, and stating explicitly the reason.

Figure 3.39: Line-up of Example 3.3.1

Example 3.3.2. This example is also from Boston dataset. However, the null hypothesisand the display are changed. The new null hypothesis is that whether the model fits the

52

data well. Each plot is the histogram of the response data, the response data of null plotsare generated by the fitted model y = f (X) + ε where ε ∼ N (0, 38.89838).


Example 3.3.3. This example is a simulation sample. The null hypothesis is that whetherthe data follows a standard Gaussian distribution. The visual display used is box plot. Alldata of null plots are generated from a Gaussian normal distribution.

53


All examples are not hard to detect some discoveries. Hopefully, you get all answersright. Solutions and reasons are given below.

• For the Example 3.3.1, the real data is in the 17th plot. As we expect, housingprice drops as crime rate per capita increase. Another discovery is the variability ofhousing price. In the low crime rate per capita area, housing price varies becausethere are many other factors affecting the price. Although those factors are still ineffect in the high crime rate per capita area, the variability of price is very small andthe price keeps low since community safety is people’s top concerns when buying ahouse.

• For the Example 3.3.2, the real data is in the 10th plot. Variables used are selected bybidirectional stepwise selection according to AIC. In fact, the model fit the data verypoorly. Some of the housing prices from the fitted model are ridiculously negative.

• For the Example 3.3.3, the real data is in 13th plot. In fact, the real data is generated

54

from a G(0, 2). From the collection of box plots, the box plot of the real data showsbigger variation than other null box plots.

Remarks:

• The line-up protocol can be applied to lots of tests under very different null hypothe-ses, and even there are more than one visual display to express the null hypothesis.Thus, selection of a appropriate and representative visual display for a specific nullhypothesis can be very tricky. However, it is very common in quantitative test.For example, considering the null hypothesis of independence of two random vari-ables X and Y , the following equivalent null hypotheses are also good candidates:FX,Y (x, y) = FX (x)FY (y), C (u, v) = uv and ϕX,Y (t, s) = ϕX (t)ϕY (s). In theExample 3.3.3, box plot is not the only choice of a visual display, and histogramand even scatter plot can be applied.

• The above figure is constructed by same data, same model and same re-samplingmethod as Example 3.3.2. However, this figure makes us more hard to determinethe plot of real data because the number of bins and the width of each bin are notfixed. Scalar, in fact, plays an important role.

• Suppose that we have 19 null plots and 1 original plot, if the real data plot is pickedout, then the null hypothesis is rejected at 0.05 level. In fact, if most of human viewers

55

can quickly find the discovery, then the original plot may have a stronger than thatif the collection of plots takes viewer longer to declare discovery. However, how theexploration time relates to the p-value is very difficulty to determined, because theexploration time is also affected by other factors, such as quality of plots and typeof plots.

• Because tester of all visual tests are human being, subjectivity of judgements on acollection of plots is inevitable. Thus, training is essential prior to carrying out thevisual test. The Rorschach protocol is used to determine the baseline for each tester,and the line-up protocol provide an on-side training through furnishing null plots.

[Buja et al., 2009] does not only establish the framework of statistical inference of graphics,but also provide a blueprint of practicing the protocols. Amazon’s Mechanical Turk cangather a group of people online to accomplish tasks that currently is not able to be per-formed by a computer. A number of collections of null plots and original plot followed theline-up protocol can be evaluated by those human viewers. Then, an empirical power andits confidence bound of a visual test can be estimated. Moreover, Amazon’s MechanicalTurk allow us to record the time each viewer spend on each collection and the demographicinformation of viewers such as gender, age and educational level.

Hofmann et al. [2012] compare the power of competing designs for graphical tests. Inthe first remark, we mention that for a null hypothesis there are more than one visualdisplay method can be used. In their experiment, the null hypothesis is centers of twogroup are the same. There are four candidate visual display methods: histogram, boxplot, density estimate and scatter plot.

56

Both groups of data are from exponential distributions with different mean λ1 = 1and λ2 = 1/ (1 + d) where d = 0.4, 0.6, 0.8, 1.0, 1.2, and they also consider to vary thesample size for each group with n1 = 15, 45, 135 and n2 = 1

3n1,

23n1 or n1. Based on

2513 line-up evaluation by 208 participants, they conclude that box plot does the bestjob uniformly in terms of empirical power that is the proportion of correct answer. Theprotocol proposed by Buja et al. [2009] can be applied to compare the empirical power ofcompeting designs with different visual displays. Moreover, more recent work of Majumderet al. [2012], they apply the protocol to test the null hypothesis in terms of variable selectionin linear regression models. They take a leap of challenging traditional t-test on variableselection. They, by simulation experiment, show that although visual tests should not bea competitor to quantitative tests, they are not rigorous as quantitative tests in terms ofmodel assumptions. In fact, visual tests are very powerful when conditions do not hold fortraditional statistical testes.

57

3.4 Visual Test of Independence

3.4.1 Visual Test of Independence via Contour

Two random variables X and Y with continuous cumulative distribution functions andcontinuous joint cumulative distribution function FX (x) and FY (y) and FX,Y (x, y) aresaid to be independent if and only if FX,Y (x, y) = FX (x)FY (y). Because those functionsare continuous, if we take the partial derivatives with respect to x and y, and then we havefX,Y (x, y) = fX (x) fY (y) where fX (x) and fY (y) are probability density functions of Xand Y respectively and fX,Y (x, y) is the joint probability density function. For a set ofobservations, the joint density estimate should be quite close to the product of marginaldensity estimates if random variables are independent. To visually measure the differencebetween joint density estimate and the product of marginal density estimates, we must havea good visual display for them. Motivated by histogram and kernel density estimation, anextended scatter plot below left will be a good visual display of joint probability densityfunction. As a comparison, a heat map of 2-dimensional kernel density estimation putbeside the extended scatter plot.

Figure 3.42: An Extended Scatter PlotAllowing Overlap of Data Points

Figure 3.43: A Heat Map of 2-D KernelDensity Estimation

The overlapping data points display can be a good alternate of 2-dimension kernel den-sity estimation. Moreover, he quality of 2-dimensional kernel density estimation depends

58

on the choice of kernel and the smoothing parameter bandwidth. However, there is nosuch a problem for overlapping data points display. The example below will reveal whenkernel is not well chosen.

Figure 3.44: An Extended Scatter PlotAllowing Overlap of Data Points

Figure 3.45: A Heat Map of 2-D KernelDensity Estimation

Therefore, the overlapping data points display is a better than 2-dimensional kerneldensity estimation in general. Hence we use overlapping data points display as a represen-tative of fX,Y (x, y). From this extended scatter plot, if there is a clear pattern, the nullhypothesis of independence will be immediately rejected. However, this is not the casemost of time. Therefore, to display the product of marginal density estimates, a contouris put on the top of data points as a guide to find discovery. Under the null hypothesis ofindependence, we know that fX,Y (x, y) = fX (x) fY (y). The marginal density estimate isobtained by kernel density estimation. The contour is drawn according to the productionof both marginal density estimates. Thus, if data points do not follow the contour, thenthe null hypothesis of independence is rejected. For example, below left plot

59

Chapter 4

Simulation Study

60

Chapter 5

Summary

61

Appendix A

Derivations

A.1 Derivation of Spearman’s Correlation Coefficient

Suppose we have n observations (xi, yi) i = 1, . . . , n. The ranks of xi and yi are denoted

by Rxi and Ryi respectively. Let Rx = 1n

∑ni=1 Rx,i = 1

n(1 + · · ·+ n) = 1

nn(n+1)

2= n+1

2and

Ry = n+12

, so Rx = Ry. Then, the sample Pearson’s correlation coefficient of Rx and Ry is

rRx,Ry =

∑ni=1

(Rxi − RX

) (Ryi − RY

)√∑ni=1

(Rxi − RX

)2∑ni=1

(Ryi − RY

)2

Since both (Rx1 , . . . , Rxn) and (Ry1 , . . . , Ryn) are always a permutation of (1, . . . , n), so∑ni=1

(Rxi − Rx

)2=∑n

i=1

(Ryi − Ry

)2. Thus, the denominator of rRx,Ry can be simplified

62

as √√√√ n∑i=1

(Rxi − Rx

)2n∑i=1

(Ryi − Ry

)2

=n∑i=1

(Rxi − Rx

)2

=n∑i=1

(Rxi −

n+ 1

2

)2

=n∑i=1

(R2xi− (n+ 1)Rxi +

(n+ 1

2

)2)

=n∑i=1

R2xi− (n+ 1)

n∑i=1

Rxi + n

(n+ 1

2

)2

=n∑i=1

R2xi− (n+ 1)

n (n+ 1)

2+ n

(n+ 1

2

)2

=n∑i=1

R2xi− n (n+ 1)2

2+n (n+ 1)2

4

=n∑i=1

R2xi− n (n+ 1)2

4

= 12 + 22 + · · ·+ n2 − n (n+ 1)2

4

=n (n+ 1) (2n+ 1)

6− n (n+ 1)2

4

=n (n2 − 1)

12

63

The numerator also can be simplified as

n∑i=1

(Rxi − Rx

) (Ryi − Ry

)=

n∑i=1

(RxiRyi −RxiRy − RxRyi + RxRy

)=

n∑i=1

(RxiRyi −

n+ 1

2Rxi −

n+ 1

2Ryi +

(n+ 1

2

)2)

=n∑i=1

RxiRyi −n+ 1

2

n∑i=1

Rxi −n+ 1

2

n∑i=1

Ryi + n

(n+ 1

2

)2

=n∑i=1

RxiRyi −n+ 1

2

n (n+ 1)

2− n+ 1

2

n (n+ 1)

2+ n

(n+ 1

2

)2

=n∑i=1

RxiRyi −n (n+ 1)2

2+n (n+ 1)2

4

=n∑i=1

RxiRyi −n (n+ 1)2

4

=1

2

(n∑i=1

R2xi

+n∑i=1

R2yi−

n∑i=1

(Rxi −Ryi)2

)− n (n+ 1)2

4

=1

2

((1 + · · ·+ n)2 + (1 + · · ·+ n)2 −

n∑i=1

(Rxi −Ryi)2

)− n (n+ 1)2

4

=1

2

(n (n+ 1) (2n+ 1)

6+n (n+ 1) (2n+ 1)

6−

n∑i=1

(Rxi −Ryi)2

)

−n (n+ 1)2

4

=n (n+ 1) (2n+ 1)

6− 1

2

n∑i=1

(Rxi −Ryi)2 − n (n+ 1)2

4

=n (n2 − 1)

12− 1

2

n∑i=1

(Rxi −Ryi)2

64

Therefore,

rRX ,RY=

∑ni=1

(Rxi − RX

) (Ryi − RY

)√∑ni=1

(Rxi − Rx

)2∑ni=1

(Ryi − Ry

)2

=

n(n2−1)12

− 12

∑ni=1 (Rxi −Ryi)

2

n(n2−1)12

= 1− 6∑n

i=1 (Rxi −Ryi)2

n (n2 − 1)

A.2 Derivation of ∆ in Hoeffding’s Independence Test

Let

ψ (a, b, c) = I (a ≥ b)− I (a ≥ c)

= I (b ≤ a)− I (c ≤ a)

where I (·) is an indicator function and

φ (x1, y1; . . . ;x5, y5) =1

4ψ (x1, x2, x3)ψ (x1, x4, x5)ψ (y1, y2, y3)ψ (y1, y4, y5)

=1

4[I (x2 ≤ x1)− I (x3 ≤ x1)] [I (x4 ≤ x1)− I (x5 ≤ x1)]

[I (y2 ≤ y1)− I (y3 ≤ y1)] [I (y4 ≤ y1)− I (y5 ≤ y1)]

=1

4

[I (x2 ≤ x1) I (x4 ≤ x1)− I (x2 ≤ x1) I (x5 ≤ x1)

−I (x3 ≤ x1) I (x4 ≤ x1) + I (x3 ≤ x1) I (x5 ≤ x1)][

I (y2 ≤ y1) I (y4 ≤ y1)− I (y2 ≤ y1) I (y5 ≤ y1)

−I (y3 ≤ y1) I (y4 ≤ y1) + I (y3 ≤ y1) I (y5 ≤ y1)]

65

If we integrate a so-called quasi-square term I (x2 ≤ x1) I (x4 ≤ x1) I (y2 ≤ y1) I (y4 ≤ y1)over FX,Y (x1, y1) · · ·FX,Y (x5, y5), then we have∫

· · ·∫I (x2 ≤ x1) I (x4 ≤ x1) I (y2 ≤ y1) I (y4 ≤ y1) dFX,Y (x1, y1) · · · dFX,Y (x5, y5)

=

(∫ ∫1 dFX,Y (x3, y3)

)(∫ ∫1 dFX,Y (x5, y5)

)∫ ∫ [(∫ ∫

I (x2 ≤ x1) I (y2 ≤ y1) dFX,Y (x2, y2)

)(∫ ∫

I (x4 ≤ x1) I (y4 ≤ y1) dFX,Y (x4, y4)

)]dFX,Y (x1, y1)

=

∫ ∫ [(∫ x1

−∞

∫ y1

−∞dFX,Y (x2, y2)

)(∫ x1

−∞

∫ y1

−∞dFX,Y (x4, y4)

)]dFX,Y (x1, y1)

=

∫ ∫[FX,Y (x1, y1)FX,Y (x1, y1)] dFX,Y (x1, y1)

=

∫ ∫F 2X,Y (x1, y1) dFX,Y (x1, y1)

If we integrate a so-called quasi-cross term I (x2 ≤ x1) I (x4 ≤ x1) I (y2 ≤ y1) I (y5 ≤ y1)over FX,Y (x1, y1) · · ·FX,Y (x5, y5), then we have∫

· · ·∫I (x2 ≤ x1) I (x4 ≤ x1) I (y2 ≤ y1) I (y5 ≤ y1) dFX,Y (x1, y1) · · · dFX,Y (x5, y5)

=

(∫ ∫1 dFX,Y (x3, y3)

)∫ ∫ [(∫ ∫I (x4 ≤ x1) dFX,Y (x4, y4)

)(∫ ∫

I (y5 ≤ x1) dFX,Y (x5, y5)

)(∫ ∫

I (x2 ≤ x1) I (y2 ≤ y1) dFX,Y (x2, y2)

)]dFX,Y (x1, y1)

=

∫ ∫[FX (x1)FY (y1)FX,Y (x1, y1)] dFX,Y (x1, y1)

66

Then we have∫· · ·∫φ (x1, y1; . . . ;x5, y5) dFX,Y (x1, y1) · · · dFX,Y (x5, y5)

=

∫· · ·∫

1

4

[I (x2 ≤ x1) I (x4 ≤ x1)− I (x2 ≤ x1) I (x5 ≤ x1)

−I (x3 ≤ x1) I (x4 ≤ x1) + I (x3 ≤ x1) I (x5 ≤ x1)][

I (y2 ≤ y1) I (y4 ≤ y1)− I (y2 ≤ y1) I (y5 ≤ y1)

−I (y3 ≤ y1) I (y4 ≤ y1) + I (y3 ≤ y1) I (y5 ≤ y1)]

dFX,Y (x1, y1) · · · dFX,Y (x5, y5)

=

∫ ∫1

4

[4(F 2X,Y (x1, y1)− FX,Y (x1, y1)FX (x1)FY (y1)− FX,Y (x1, y1)FX (x1)FY (y1)

+F 2X (x1)F 2

Y (y1))]

dFX,Y (x1, y1)

=

∫ ∫1

4

[4 (FX,Y (x1, y1)− FX (x1)FY (y1))2] dFX,Y (x1, y1)

=

∫ ∫(FX,Y (x1, y1)− FX (x1)FY (y1))2 dFX,Y (x1, y1)

=

∫ ∫(FX,Y (x, y)− FX (x)FY (y))2 dFX,Y (x, y)

= ∆

A.3 Derivation of V ar (Dn) in Hoeffding’s Indepen-

dence Test

Let a symmetric function Φ (X1, Y1; · · · ;X5, Y5) be

Φ (X1, Y1; · · · ;X5, Y5) = D5 =1

5!

5∑α1=1

5∑α2=1

5∑α3=1

5∑α4=1

5∑α5=1

φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)

where αi = 1, . . . , n and αi 6= αj if i 6= j (i, j = 1, . . . , 5). Thus, Dn can be re-written as

Dn =

(n

5

)−1 n∑α1=1

n∑α2=1>α1

n∑α3=1>α2

n∑α4=1>α3

n∑α5=1>α4

Φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)

67

Let

Φk (x1, y1; · · · ;xk, yk)

= E(k+1),...,5 [Φ (x1, y1; · · · ;xk, yk;Xk+1, Yk+1; · · · ;X5, Y5)] k = 1, . . . , 4

Φ5 (x1, y1; · · · ;x5, y5)

= Φ (x1, y1; · · · ;x5, y5)

where (x1, y1) , . . . , (xk, yk) are arbitrary fixed vectors, (Xk+1, Yk+1) , . . . , (Y5, Y5) are ran-dom vectors and E(k+1),...,5 denotes the expectation taken over (Xk+1, Yk+1) , . . . , (X5, Y5).Then, we have

E1,...,k [Φk (X1, Y1; · · · ;Xk, Yk)]

= E1,...,k

[E(k+1),...,5 [Φ (X1, Y1; · · · ;Xk, Yk;Xk+1, Yk+1; · · · ;X5, Y5)]

]= E1,...,5 [Φ (X1, Y1; · · · ; · · · ;X5, Y5)]

= E [D5]

= ∆ k = 1, . . . , 5

Suppose the variance of Φk (X1, Y1; · · · ;Xk, Yk) exists and denoted by ξk

ξk = V ar1,...,k (Φk (X1, Y1; · · · ;Xk, Yk))

= E1,...,k

[(Φk (X1, Y1; · · · ;Xk, Yk)−∆)2]

Suppose (α1, . . . , α5) and (β1, . . . , β5) are any 5 distinct numbers from (1, . . . , n) and k isthe number of integers in common. Then,

ξk = E [(Φk (Xα1 , Yα1 ; · · · ;Xα5 , Yα5)−∆) (Φk (Xβ1 , Yβ1 ; · · · ;Xβ5 , Yβ5)−∆)]

= E [(Φ (Xα1 , Yα1 ; · · · ;Xα5 , Yα5)−∆) (Φ (Xβ1 , Yβ1 ; · · · ;Xβ5 , Yβ5)−∆)]

68

Therefore, we obtain that

V ar (Dn)

= E[(Dn −∆)2]

= E

((n5

)−1 n∑α1=1

n∑α2=1>α1

n∑α3=1>α2

n∑α4=1>α3

n∑α5=1>α4

Φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)−∆

)2

= E

((n5

)−1 n∑α1=1

n∑α2=1>α1

n∑α3=1>α2

n∑α4=1>α3

n∑α5=1>α4

[Φ (Xα1 , Yα1 ; . . . ;Xα5 , Yα5)−∆]

)2

=

(n

5

)−2 5∑k=1

∑(k)

E [(Φ (Xα1 , Yα1 ; · · · ;Xα5 , Yα5)−∆) (Φ (Xβ1 , Yβ1 ; · · · ;Xβ5 , Yβ5)−∆)]

=

(n

5

)−2 5∑k=1

∑(k)

ξk

where∑

(k) is the sum over all αi and βj such that there are exactly k equations αi = βj.

Thus, the number of terms in∑

(k) is

(n

5

)(n− 5

5− k

)(5

k

). Thus,

V ar (Dn) =

(n

5

)−2 5∑k=1

(n

5

)(n− 5

5− k

)(5

k

)ξk

=

(n

5

)−1 5∑k=1

(n− 5

5− k

)(5

k

)ξk

69

A.4 Derivation of Computational Form of Tn in Test

of Independence Based on the Empirical Copula

Process

Tn = n

∫ 1

0

∫ 1

0

(Cn (u, v)− C (u, v))2 1

uvdC (u, v)

= n

∫ 1

0

∫ 1

0

(Cn (u, v)− uv)2 uv

uvdu dv

= n

∫ 1

0

∫ 1

0

(Cn (u, v)− uv)2 du dv

= n

∫ 1

0

∫ 1

0

(1

n

n∑i=1

I (Rxi ≤ un) I (Ryi ≤ vn)− uv

)2

du dv

= n

∫ 1

0

∫ 1

0

(1

n

n∑i=1

[I (Rxi ≤ un)− u] [I (Ryi ≤ vn)− v]

)2

du dv

=1

n

∫ 1

0

∫ 1

0

(n∑i=1

[I (Rxi ≤ un)− u] [I (Ryi ≤ vn)− v]

)2

du dv

=1

n

∫ 1

0

∫ 1

0

n∑i=1

n∑j=1

[I (Rxi ≤ un)− u][I(Rxj ≤ un

)− u]

[I (Ryi ≤ vn)− v][I(Ryj ≤ vn

)− v]

du dv

=1

n

n∑i=1

n∑j=1

∫ 1

0

∫ 1

0


)− u]


)− v]

du dv

=1

n

n∑i=1

n∑j=1

(∫ 1

0


)− u]

du

)(∫ 1

0


)− v]

dv

)

70

Aside: ∫ 1

0


)− u]

du

=

∫ 1

0

I (Rxi ≤ un) I(Rxj ≤ un

)− uI

(Rxj ≤ un

)− uI (Rxi ≤ un) + u2 du

=

∫ 1

0

I (Rxi ≤ un) I(Rxj ≤ un

)du−

∫ 1

0

uI(Rxj ≤ un

)du

−∫ 1

0

uI (Rxi ≤ un) du+

∫ 1

0

u2 du

=

∫ 1

0

I

(u ≥ Rxi

n

)I

(u ≥

Rxj

n

)du−

∫ 1

0

uI

(u ≥

Rxj

n

)du

−∫ 1

0

uI

(u ≥ Rxi

n

)du+

∫ 1

0

u2 du

=

∫ 1

max

(Rxin,Rxjn

) 1 du−∫ 1

Rxjn

u · 1 du−∫ 1

Rxin

u · 1 du+

∫ 1

0

u2 du

= u|1max

(Rxin,Rxjn

) − 1

2u2

∣∣∣∣1Rxjn

− 1

2u2

∣∣∣∣1Rxin

+1

3u3

∣∣∣∣10

= 1−max(Rxi

n,Rxj

n

)− 1

2

(1−

(Rxj

n

)2)− 1

2

(1−

(Rxi

n

)2)

+1

3

=1

3+

1

2

(Rxi

n

)2

+1

2

(Rxj

n

)2

−max

(Rxi , Rxj

)n

Therefore,

Tn =1

n

n∑i=1

n∑j=1

(∫ 1

0


)− u]

du

)(∫ 1

0


)− v]

dv

)=

1

n

n∑i=1

n∑j=1

(1

3+

1

2

(Rxi

n

)2

+1

2

(Rxj

n

)2

−max

(Rxi , Rxj

)n

)(

1

3+

1

2

(Ryi

n

)2

+1

2

(Ryj

n

)2

−max

(Ryi , Ryj

)n

)

71

A.5 Simplification of |ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2 in Dis-

tance Correlation Coefficient

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2

=∣∣∣E [(eitTX − ϕX (t)

)(eis

TY − ϕY (s))]∣∣∣2

≤ E

[∣∣∣eitTX − ϕX (t)∣∣∣2]E [∣∣∣eisTY − ϕY (s)

∣∣∣2]= E

[∣∣(cos (tTX)+ isin(tTX

))− ϕX (t)

∣∣2]E [∣∣(cos (sTY )+ isin(sTY

))− ϕY (s)

∣∣2]By Euler’s Formula

Since ϕX (t) = E[eitTX ] = E[cos

(tTX

)+ isin

(tTX

)] = E

[cos(tTX

)]+ iE

[sin(tTX

)],

so we can let a = E[cos(tTX

)]and b = E

[sin(tTX

)]. Then, ϕX (t) = a+ib. To simplify

E[∣∣(cos (tTX)+ isin

(tTX

))− ϕX (t)

∣∣2], we have an aside:

E[∣∣(cos (tTX)+ isin

(tTX

))− ϕX (t)

∣∣2]= E

[(cos(tTX

)+ isin

(tTX

)− (a+ ib)

) (cos(tTX

)− isin

(tTX

)− (a− ib)

)]= E

[cos2

(tTX

)− isin

(tTX

)cos(tTX

)− (a− ib) cos

(tTX

)+ isin

(tTX

)cos(tTX

)−i2sin2

(tTX

)− i (a− ib) sin

(tTX

)− (a+ ib) cos

(tTX

)+ i (a+ ib) sin

(tTX

)+ (a+ ib) (a− ib)

]= E[1− 2acos

(tTX

)− 2bsin

(tTX

)+ a2 + b2]

= 1− 2aE[cos(tTX

)]− 2bE

[sin(tTX

)]+ a2 + b2

= 1− 2a2 − 2b2 + a2 + b2

= 1− a2 − b2

= 1− (a2 + b2)

= 1− (a+ ib) (a− ib)= 1− ϕX (t)ϕX (t)

= 1− |ϕX (t)|2

72

Similarly, E[∣∣(cos (sTY )+ isin

(sTY

))− ϕY (s)

∣∣2] = 1− |ϕY (s)|2. In addition, if X ′ and

Y ′ are independent copies of X and Y respectively, and then we have

|ϕX (t)|2 = ϕX (t)ϕX′ (t)

= E[cos(tTX

)+ isin

(tTX

)]E[cos(tTX ′

)− isin

(tTX ′

)]= E

[cos(tTX

)cos(tTX ′

)− isin

(tTX ′

)cos(tTX

)+isin

(tTX

)cos(tTX ′

)− i2sin

(tTX

)sin(tTX ′

) ]= E

[cos(tTX

)cos(tTX ′

)+ sin

(tTX

)sin(tTX ′

)]= E

[cos(tT (X −X ′)

)]and

|ϕY (s)|2 = E[cos(sT (Y − Y ′)

)]Therefore,

|ϕX,Y (t, s)− ϕX (t) · ϕY (s)|2

≤(1− |ϕX (t)|2

) (1− |ϕY (s)|2

)A.6 Derivation of Computational Form of V2

n (X,Y),

V2n (X) and V2

n (Y) in Distance Correlation Coeffi-

cient

To obtain the computational form of V2n (X,Y), let us start with

∣∣ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)∣∣2.∣∣ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)

∣∣2=

(ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)

)(ϕnX,Y (t, s)− ϕnX (t)ϕnY (s)

)= ϕnX,Y (t, s)ϕnX,Y (t, s)− ϕnX,Y (t, s)ϕnX (t)ϕnY (s)

−ϕnX (t)ϕnY (s)ϕnX,Y (t, s) + ϕnX (t)ϕnY (s)ϕnX (t)ϕnY (s)

73

The first part of the above equation is

ϕnX,Y (t, s)ϕnX,Y (t, s)

=1

n

n∑k=1

(cos(tTXk + sTYk

)+ isin

(tTXk + sTYk

))1

n

n∑l=1

(cos (tTXl + sTYl) + isin (tTXl + sTYl))

=1

n

n∑k=1

(cos(tTXk + sTYk

)+ isin

(tTXk + sTYk

))1

n

n∑l=1

(cos(tTXl + sTYl

)− isin

(tTXl + sTYl

))=

1

n2

n∑k=1

n∑l=1

(cos(tTXk + sTYk

)+ isin

(tTXk + sTYk

))(cos(tTXl + sTYl

)− isin

(tTXl + sTYl

))=

1

n2

n∑k=1

n∑l=1

[cos((

tTXk + sTYk

)−(tTXl + sTYl

))+isin

((tTXk + sTYk

)−(tTXl + sTYl

)) ]=

1

n2

n∑k=1

n∑l=1

[cos(tT (Xk −Xl) + sT (Yk −Yl)

)+isin

(tT (Xk −Xl) + sT (Yk −Yl)

) ]=

1

n2

n∑k=1

n∑l=1

ei(tT (Xk−Xl)+sT (Yk−Yl))

=1

n2

n∑k=1

n∑l=1

eitT (Xk−Xl)eis

T (Yk−Yl)

74

And we have the following fact

1

n2

n∑k=1

n∑l=1

∫Rp+q

eitT (Xk−Xl)eis

T (Yk−Yl)

cpcq‖t‖1+p‖s‖1+qdt ds

=1

n2

n∑k=1

n∑l=1

∫Rp

eitT (Xk−Xl)

cp‖t‖1+pdt

∫Rq

eisT (Yk−Yl)

cq‖s‖1+qds

=1

n2

n∑k=1

n∑l=1

‖Xk −Xl‖ ‖Yk − Yl‖

Then, V2n (X,Y) can be simplified to

V2n (X,Y) =

1

n2

n∑k=1

n∑l=1

‖Xk −Xl‖ ‖Yk − Yl‖ −2

n3

n∑k=1

n∑l=1

n∑m=1

‖Xk −Xl‖ ‖Yk − Ym‖

+1

n2

n∑k=1

n∑l=1

‖Xk −Xl‖1

n2

n∑k=1

n∑l=1

‖Yk − Yl‖

Let akl = ‖Xk −Xl‖ and bkl = ‖Yk − Yl‖, and then

V2n (X,Y)

=1

n2

n∑k=1

n∑l=1

aklbkl +1

n2

n∑k=1

n∑l=1

akl1

n2

n∑k=1

n∑l=1

bkl −2

n3

n∑k=1

n∑l=1

n∑m=1

aklbkm

=1

n2

n∑k=1

n∑l=1

aklbkl +1

n

n∑k=1

ak·1

n

n∑l=1

b·l −2

n

n∑k=1

ak·bk·

=1

n2

n∑k=1

n∑l=1

aklbkl +1

n2

n∑k=1

n∑l=1

ak·b·l −2

n

n∑k=1

ak·bk·

=1

n2

n∑k=1

n∑l=1

(akl − ak· − a·l + a··)(bkl − bk· − b·l + b··

)=

1

n2

n∑k=1

n∑l=1

AklBkl where

Similarly,

V2n (X) = V2

n (X,X) =1

n2

n∑k=1

n∑l=1

A2kl

75

and

V2n (Y) = V2

n (Y,Y) =1

n2

n∑k=1

n∑l=1

B2kl

where Akl = akl − ak· − a·l + a·· and Bkl = bkl − bk· − b·l + b··.

76

Appendix B

R Code

77

References

TW Anderson. On the distribution of the two-sample cramer-von mises criterion. TheAnnals of Mathematical Statistics, pages 1148–1159, 1962.

G.A. Barnard. A new test for 2x2 table. Nature, 156, 1945.

J. Bertin. Semiology of Graphics: Diagrams, Networks, Maps. University of WisconsinPress, 1983.

A. Buja, D. Cook, H. Hofmann, M. Lawrence, E.K. Lee, D.F. Swayne, and H. Wickham.Statistical inference for exploratory data analysis and model diagnostics. PhilosophicalTransactions of the Royal Society A: Mathematical, Physical and Engineering Sciences,367(1906):4361–4383, 2009.

W.H. Cherry and R.W. Oldford. Picturing probability: the poverty of venn diagrams, therichness of eikosograms. submitted for publication, 2003.

P. Deheuvels. An asymptotic decomposition for multivariate distribution-free tests of in-dependence. Journal of Multivariate Analysis, 11(1):102–113, 1981.

M. Friendly. Mosaic displays for multi-way contingency tables. Journal of the AmericanStatistical Association, 89(425):190–200, 1994.

M Friendly. A brief history of data visualization. Handbook of data visualization, pages15–56, 2008.

C. Genest and B. Remillard. Tests of independence and randomness based on the empiricalcopula process. Sociedad de Estadıstica e Investigacion Operativa, 13(2):335–369, 2004.

K. Ghoudi, R.J. Kulperger, and B. Remillard. A non-parametric test of serial independencefor time series and residuals. Journal of Multivariate Analysis, 79(3):191–218, 2001.

78

M.J. Greenacre. Theory and Applications of Correspondence Analysis. Academic Press,1984.

M.J. Greenacre and T. Hastie. The geometric interpretation of correspondence analysis.Journal of the American Statistical Association, 82(398):437–447, 1987.

J.A. Hartigan and B. Kleiner. Mosaics for contigency tables. Computer Science andStatistics: Proceedings of the 13th Symposium on the Interface, 22, 1981.

W. Hoeffding. A non-parametric test of independence. The Annals of Mathematical Statis-tics, 19(4):546–557, 1948a.

W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annalsof Mathematical Statistics, 19(3):293–325, 1948b.

H. Hofmann, L. Follett, M. Majumder, and D. Cook. Graphical tests for power comparisonof competing designs. Visualization and Computer Graphics, IEEE Transactions on, 18(12):2441–2448, 2012.

D.E. Holmes. Innovations in Bayesian networks: Theory and applications, volume 156.Springer, 2008.

W.H. Kruskal. Ordinal measures of association. Journal of the American Statistical Asso-ciation, 53(284):814–861, 1958.

M. Majumder, H. Hofmann, and D. Cook. Validation of visual statistical inference, withapplication to linear models. http://www.stat.iastate.edu/preprint/articles/2012-04.pdf,2012.

D. Meyer, A. Zeileis, and K. Hornik. Visualizing independence using extended associ-ation plots. Proceedings of the 3rd International Workshop on Distributed StatisticalComputing (DSC 2003), 2003.

R.B. Nelsen. An Introduction to Copulas. Springer, second edition, 2006.

R.D. Snee. Graphical display of two-way contingency tables. The American Statistician,28(1):9–12, 1974.

G.J. Szekely and M.L. Rizzo. Hierarchical clustering via joint between-within distances:Extending ward’s minimum variance method. Journal of Classification, 22(2):151–183,2005.

79

G.J. Szekely, M.L. Rizzo, and N.K. Bakirov. Measuring and testing dependence by corre-lation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.

J.W. Tukey. The future of data analysis. The Annals of Mathematical Statistics, 33(1):1–67, 1962.

R. Von Mises. Mathematical theory of probability and statistics. Mathematical Theory ofProbability and Statistics, New York: Academic Press, 1964, 1, 1964.

P. Whittle. Probability. Probability via Expectation, pages 39–50, 2000.

80

Documents

A Visualizing Tool for Conditional Independencerwoldfor/students/DerekWang/Thesis.pdf · i=1 R x i R X R y i R q Y P n i=1 R x i R X 2 P n i=1 R y i R Y 2 = 1 6 P n i=1 (R x i R y)