Distance Correlation E-Statistics

Distance Correlation E-Statistics

Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences

Columbia University, April 28-April 30, 2014

Topics• Lecture 1. Distance Correlation. From correlation (Galton/Pearson, 1895) to

distance correlation (Szekely, 2005). Important measures of dependences and how to classify them via invariances. Distance correlation t-test of independence. Open problems for big data.

• Lecture 2. Energy statistics (E-statistics) and their applications. Testing for symmetry, testing for normality, DISCO analysis, energy clustering, etc. A simple inequality on energy statistics and a beautiful theorem of Fourier transforms. What makes a statistic U (or V)?

• Lecture 3. Brownian correlation. Correlation with respect to stochastic processes. Distances and negative definite functions. Physics principles in statistics (the uncertainty principle of statistics, symmetries/invariances, equilibrium estimates). CLT for dependent variables via Brownian correlation. What if the sample is not iid, what if the sample comes from a stochastic process?

• Colloquium talk. Partial distance correlation. Distance correlation and dissimilarities via unbiased distance covariance estimates. What is wrong with the Mantel test? Variable selection via pdCor. What is a good measure of dependence? My Erlangen program in Statistics.

Lecture 1.Distance Correlation

Dependence Measures and Tests for Independence

Kolmogorov: “Independence is the most important notion of probability theory”

•Correlation (Galton 1885-1888, Natural Inheritance, 1889, Pearson, 1895)•Chi-square (Pearson, 1900) •Spearman’s rank correlation (1904) Amer. J. Psychol. 15: 72–101.•Fisher, R. (1922) and Fisher’s exact test •Kendall’s tau (1938) A New Measure of Rank Correlation".Biometrika 30 (1–2):81–89.•Maximal correlation (Hirschfeld) (Gebelein,1941) ( Lancaster, 1957), (Rényi,1959), (Sarmanov,1958), (Buja, 1990) Dembo(2001)•Hoeffding’s independence test (1948) Annals of Mathematical Statistics 19: 293–325, 1948.•Blum-Kiefer-Wolfowitz (1961)•Mantel test (1967)•RKHS Baker (1973), Fukumizo, Gretton, Poczos, …•RV coefficient (1976) Robert, P.; Escoufier, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient“ Applied Statistics 25 (3): 257–265.•Also here is a question/answer from Stack Exchange which mentioned dcor and it was better than RV coefficient, apparently. •http://math.stackexchange.com/questions/690972/distance-or-similarity-between-matrices-that-are-not-the-same-size

•Distance correlation (dCor) Szekely (2005), Szekely Bakirov and Rizzo (2007) • nice free version apparently for Matlab/Octave. I should perhaps add a link to our energy page. •http://mastrave.org/doc/mtv_m/dist_corr

•Brownian correlation Szekely and Rizzo (2009)

DCor generalizes and improves Correlation, RV, Mantel and Chi-square (denominator!)

MIC, 2010

Valhalla --- GÖTTERDÄMMERUNG

http://en.wikipedia.org/wiki/Biometrika

http://math.stackexchange.com/questions/690972/distance-or-similarity-between-matrices-that-are-not-the-same-size

http://mastrave.org/doc/mtv_m/dist_corr

Kolmogorov: “Independence is the most important notion of probability theory”

What is Pearson’s correlation?Sample: (Xk ,Yk ) k=1,2,…,n

Centered sample: Ak,=Xk-X. Bk=Yk-Y.

cov(x,y)=(1/n)ΣkAkBk

r:=cor(x,y) = cov(x,y)/[cov(x,x) cov(y,y)]1/2

Prehistory: (i) Gauss (1823) – normal surface with n correlated variables – for Gauss this was just one of the several parameters(ii) Auguste Bravais(1846) referred to one of the parameters of the bivariate normal distribution as « une correlation” but like Gauss he did not recognize the importance of correlation as a measure of dependence between variables. [Analyse mathématique sur les probabilités des erreurs de situation d'un point.

Mémoires présentés par divers savants à l'Académie royale des sciences de l'Institut de France, 9, 255-332.]

(iii)Francis Galton (1885-1888) (iv)Karl Pearson (1895) product-moment r

LIII. On lines and planes of closest fit to systems of points in spacePhilosophical Magazine Series 6, 1901 -- cited by 1700

Pearson had no unpublished thoughts

Why do we (NOT) like Pearson’s correlation? What is the remedy?

http://www.informaworld.com/index/910468439.pdf

Apples and Oranges

If we want to study the dependence between oranges and apples then it is hard to add or multiply them but it is always easy to do the same with their distances.

ak,l:= |Xk – Xl| bk,l:= |Yk – Yl|

for k,l=1,2,…,n

Ak,l:= ak,l–ak.–a. l + a. . Bk,l:= bk,l–bk .–b. l + b. .

Distance Cov(X,Y)2:=dCov²(X,Y) :=V²(X,Y):= (1/n2)Σ k lAk,l Bk,l

≥ 0 (!?!) seeSzekely, Rizzo, Bakirov(2007) Ann. Statist. 35/7

Distance Covariance:V²(X,Y):= (1/n2)Σ k lAk,l Bk,l

Distance standard deviation: V(X):= V(X,X), V(Y):=V(Y,Y)

Distance Correlation: dCor(X,Y)²:=R(X,Y)²:=

V(X,Y)²/ V(X)V(Y)

This should be introduced in our teaching at the

undergraduate level.

The population values are a.s.limits of the empirical ones

as n→∞ .

Thm: dCov²=||fn(s,t)-fn(s)fn(t)||²

where ||.|| is the L2-norm withthe singular kernel

w(s,t):= c/(st)²This kernel is unique if we have the following invariance: dCov²(a1+b1O1X, a2+b2O2Y)=b1b2dCov²(X,Y).

A beautriful theorem on Fourier transforms

∫(1-cos tx)/t2dt= c|x|

The Fourier transform of any power of |t|

is a constant times a power of |x|

Gel’fand, I. M. – Shilov, G. E. (1958, 1964), Generalized Functions

Thm

V(X) =0 iff X is constant

V(a + bCX) = |b|V(X)

V(X+Y) <= V(X) + V(Y) for independent rv’s with equality iff X or Y is constant

0 ≤ dCorr(X,Y) ≤1

dCorr(X,Y) =0 iff X, Y are independent

dCorr (X,Y) = 1 iff Y=a + bXC

akl:= |Xk – Xl|α

bkl:= |Yk – Yl|α

Rα, for 0< α <2 [R1 = R]

R 2(X,Y)= |Cor(X,Y)|=|Pearson’s correlation|

E-statistics(energy statistics). R package version 1.1-0.

Thm Under independence of X and Y

n dCov2n(X,Y) →Q= ∑ λkZ²k

otherwise the limit is ∞

Thus we have a universally consistent test of independence

What if (X,Y) is bivariate normal?

In this case

0.89 |corr|≤dCor ≤ |cor|

Unbiased Distance Correlation

Unbiased distance correlation

Unbiased estimator of dCor² (X, Y) is

dCor*n := < A*, B*>:= 1/[n(n-3)] (A*, B*)

This is an inner product in the linear space Hn of nxn matrices generated by nxn distance matrices. The population Hilbert space is denoted by H where the inner product is (generated by) dCov*(X, Y).

The power of dCor test for independence is very good especially for high dimensions p,q

Denote the unbiased version by dCov*nThe corresponding bias corrected distance

correlation is

R*n

This is the correlation for the 21st century.Theorem. In high dimension if the CLT holds for the coordinates then

Tn:=[M-1] 1/2 R*n/[1-(R*n)2]1/2

where M=n(n-3)/2 is t-distributed with d.f. M-1.

Why?

R*n = Σij Uij Vij / [Σ Uij2 Σ Vij

2 ] 1/2

with iid standard normal variables.

Put Zij = Uij / [Σ Uij2

] 1/2 ; then Σ Zij2 = 1

Under the null, independence of Uij and Vij,

Zij does not depend on Vij

Given Z, by Cochran’s thm (the square of Σij ZijVij has rank 1), Tn is t-distributed when Z is given, thus even without this condition.

Under the alternative?

We need to show that if U, V are standard normal with zero expected value and correlation ρ>0 then

P(UV > c) is a monotone increasing function of ρ.

For the proof notice that if X, Y are iid standard normal

a²+b² =1, 2ab = ρ then for

U:=aX+bY and V:= bX+aY

We have Var(U)=Var(V) = 1 and E(UV)=2ab = ρ.

Thus

UV=ab(X²+Y²)+(a²+b²)XY= ρ(X²+Y²)/2 + XY

Q.E.D.(I do not need but I do not know what if the expectations are not zero.)

The number of operations is O(n²),independently of the dimension

which can even be infinite (X and Y can be in two different metric spaces –

Hilbert spaces)

The storage complexity can be reduced to O(n) via recursive formula

Parallel processing for big n?

A characteristic measure of dependence (population value)

dCov2(X,Y) =E|X-X’||Y-Y’|+E|X-X’|E|Y-Y’| - 2E|X-X’||Y-Y’’|

dCov = cov of distances?

(X,Y) , (X’,Y’), (X”, Y”) are iid

dCov2(X,Y)=E[|X–X’||Y-Y’|] +E|X-X’|E|Y-Y’|

-E[|X–X’||Y-Y’’|] - E[|X–X’’||Y-Y’|] =

cov(|X–X’|, |Y–Y’|) – 2cov(|X-X’|, |Y-Y”|)(i)Does cov(|X–X’|, |Y–Y’|)=0 imply X and Y are independent?

(ii)Does the independence of X and Y imply the independence of X, Y?

(i)q(x)=–c/2 for -1<x<0, ½ for 0<x<c, 0 otherwise, p(x,y):=1/4–q(x)q(y)

Max correlation?

sup f,g Cor(f(X), g(Y)) for all f,g Borel functions with 0 < Var f(X), Var g(Y) < ∞.

Why should we (not) like max cor?

If max cor (X, Y) = 0 then X, Y are independent

For bivariate normal normal maxcor = |cor|

For partial sums if iid maxcor2(Sm,Sn)=m/n for m≤n

Sarmanov(1958) Dokl. Nauk. SSSR

What is wrong with maxcor?

What is the meaning of max cor = 1?

Trigonometric coins

Snn := := sin U+ sin 2U + … + sin nU

tends to Cauchy (we did not divide by √n !!)

Open problem: What is the sup of dCor for uncorrelated X and Y. Can it be > 0.85 ?

Lecture 2.Energy Statistics (E-statistics)

Newton’s gravitational potential energy can be generalized for statistical applications.

Statistical observations are heavenly bodies (in a metric space) governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis holds.

Potential energy statistics are symmetric functions of distances between statistical observations in metric spaces.

EXAMPLE Testing Independence

Potential Energy Statistics

Potential energy statistics or energy statistics or E-statistics in short are U-statistics or V-statistics that are functions of distances between sample elements.

The idea is to consider statistical observations as heavenly bodies governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis is true.

http://en.wikipedia.org/wiki/Heavenly_body

http://en.wikipedia.org/wiki/Potential_energy

http://en.wikipedia.org/wiki/Null_hypothesis

Distances and Energy: the next level of

abstraction (Prelude)In the beginning Man created integers. The accountants of Uruk in Mesopotamia, about five thousand years ago, invented the first numerals – signs encoding the concept of oneness, twoness, threeness, etc. abstracted from any particular entity. Before that for about another 5000 years jars of oil were counted with ovoid, measures of grain were counted with cones, etc. , numbers were indicated with one-to-one correspondence. Numerals revolutionized our civilization: they expressed abstract thoughts, after all, “two” does not exist in nature, only two fingers, two people, two sheep, two apples, two oranges. After this abstraction we could not tell from the numerals what the objects were; seeing the signs of 1,2,3,... we could not see or smell oranges, apples, etc. but we could do comparisons, we could do “statistics”, “statistical inference”.

In this lecture instead of working with statistical observations, data taking integer or real values, or taking values in Euclidean spaces, Hilbert spaces or in more general metric spaces we make inferences from their distances. Distances and angles make wonders in science (see e.g. Thales 600 BC; G.J. Szekely: Thales and the Ten Commandments). Here we will exploit this in statistics. Instead of working with numbers, vectors, functions, etc. first we compute their distances and all our inferences will be based on these distances. This is the next level of abstraction where not only we cannot tell the objects, we cannot even tell how big their numbers are, we cannot tell what the data are, we can just tell how far they are from each other. At this level of abstraction we of course lose even more information, we cannot sense lots of properties of data, e.g. if we add the same constant to all data then their distances won’t change. No rigid motion in the space changes the distances. On the other hand we gain a lot: distances are always easy to add, multiply, etc. even when it is not so natural to add or multiply vectors and more abstract observations especially if they are not from the same space.

The next level of abstraction is energy statistic: invariance wrt ratios of distances: the angles are invariant. Distance correlation is depends on angles.

Goodness-of-fit

Dual space

Application in statistics

Construct a U (or V) statistic with kernel

h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y|

Vn =(1/n2)∑ h(Xi,Xk)

By the NULL: Eh(X,Y) =0 but h is also a rank 1 degenerate kernel because Eh(x,Y’) =0 a.s. under the null thus

Under the Null the limit distribution of nVn is

Q:=∑kλkZk2 where λk are eigenvalues of

Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x)

and under the alternative (X and Y has different distributions)

nVn →∞ a.s.

What to do with Hilbert-Schmidt?

∫h(x,y)ψ(y)dF(y) = λψ(x), Q:=∑kλkZk2

(i)Approximate the eigenvalues: (1/n)∑i h(Xi, Xj)ψ=λψ

(ii) If Σiλi = 1 then P(Q ≥ c) ≤ P(Z2 ≥ c) if this

probability is at most 0.215 [conservative, consistent test)

(iii) t-test (see later)

Simple vs Good

Eα(X,Y):= 2E|X-Y|α -E|X-X’|α -E|Y-Y’|α ≥ 0

For 0 < α < 2

= 0 iff X and Y are identically distributed

For α=2 we have

E2(X,Y):= 2[E(X) – E(Y)]2

In case of “classical statistics” (α =2) life is simple but not always good

In case of “energy statistics” (0 < α < 2) life is not so simple but good.

Testing for (multivariate) normality

Standardized sample: y1 y2,…, yn

E|Z – Z’|= ? For U(a,b): E|U-U’|=|b-a|/3

for exponential E|e-e’| = 1/λE|y – Z|= ? For U(a,b): E|x-U|= quadratic polynomial

(hint: if Z is a d-variate standard normal then |y-Z|2 has a noncentral chi-square distribution with non-centrality parameter |y|2/2 and d.f. d+p where p is a Poisson r.v. with mean |y|2/2, see Zacks (1981) p. 55)

In 1 dim: E|y – Z| = (2/π)1/2exp{-y2/2}+ x Φ(y) - y

√For implementation see Energy package in R and Szekely and Rizzo (2004).

Why is energy a very good test for normality?

1. It is affine invariant

2. Consistent against general alternatives

3. Powerful omnibus testIn the univariate case our energy test is “almost” the same

as the Anderson-Darling EDF test based on

∫(Fn(x) – F(x))2 dF(x)/ [F(x)(1-F(x)] But here dF(x)/[F(x)(1-F(x)] is close to constant for standard normal F and thus

almost the same as “energy” thus our energy test is essentially a multivariate extension of the powerful Anderson-Darling test.

Distance skewness

Advantages:

Skew(X):= E[X- E(X)/σ]³ = 0 does NOT characterize symmetry but

distance skewness:

dSkew(X):= 1–E|X-X’|/E|X+X’|=0

iff X is centrally symmetric.

Sample: 1-Σ|Xi-Xk|/Σ|Xi+Xk|

DISCO: a nonparametric extension of ANOVA

DISCO is a multi-sample test of equal distributions, a generalization of the hypothesis of equal means which is ANOVA.

Put A=(X1, X2 , …, Xn), B=(Y1, Y2 , …, Ym), and d(A, B):=(1/nm)∑ |Xi – Yk|

Within-sample dispersion W:= ∑j (nj/2)∑ d(Aj ,Aj )

Put N =n1 +n2+…+nK and A:= {A1 , A2 , …, Ak}

Total dispersion T:=(N/2) d(A,A)

Thm. T = B + W

where B,the between sample dispersion, is the energy distance, i.e. the weighted sum of

E(Aj , Ak)= 2 d(Aj ,Ak)- d(Aj ,Aj)- d(Ak ,Ak)

The same thing with exponent α = 2 in d(A, B) is ANOVA

E-clustering

Hierarchical clustering: we merge clusters with minimum energy distance:

E(CiUCj,Ck)=(ni+nk)/(ni+nj+nk)E(Ci,Ck)+(nj+nk)/(ni+nj+nk)E(Cj, Ck) - nk/(ni+nj+nk)E(Ci, Cj)

In E-clustering not only the cluster centers matter but the cluster point distributions. If the exponent in d is α=2 then we get Ward’s minimum variance method, a geometrical method that separates and identifies clusters by their centers. Thus Ward is not consistent but E-clustering is consistent. The ability of E-clustering to separate and identify clusters with equal or nearly equal centers has important practical applications.

For details see Szekely- Rizzo (2005) Hierarchical clustering via joint between-within distances, Journal of Classification, 22(2), 151-183.

Under the Null the limit distribution of nVn is

Q:=∑kλkZk2 where λk are eigenvalues of

Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x)

where h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y| Differentiate twice:

-ψ”/(2f) = Eψ with boundary conditions: ψ’(a)= ψ’(b)= 0

(the second derivative wrt x of (1/2)|x-y| is -δ(x-y) where δ is the Dirac delta)

Thus in 1Dimension E= 1/λ

Thus we transformed the potential energy (Hilbert-Schmidt) equation into a kinetic energy (Schrödinger) equation.

Schrödinger equation(1926):

-ψ(x)”/(2m) + V(x)ψ(x) = (E + 1/E)ψ(x)

Energy conservation law?

Kinetic Energy (E)

My Erlangen Program in Statistics

Klein, Felix 1872. "A comparative review of recent researches in geometry". This is a classification of geometries via invariances (Euclidean, Similarity, Affine, Projective,…) Klein was then at Erlangen.

Energy statistics are always rigid motion invariant, their ratios, e.g. dCor is also invariant wrt scaling (angles remain invariant like in Thales’s geometry of similarities)

Can we have more invariance? In the univariate case we have monotone invariant rank statistics. But in the multivariate case if a statistic is 1-1 affine/projective invariant and continuous then it is constant. (projection is affine but not 1-1, still because of continuity thr statistics are invariant to all projections to (coordinate) lines thus they are constant).

Affine invariant energy statistics

They cannot be continuous but in case of testing for normality affine invariance is natural (it is not natural for testing independence because it changes angles).

BUT dCor =0 is invariant with respect to all 1-1 Borel functions and max cor is also invariant wrt all 1-1 Borel but these are population values.

Maximal correlation is too invariant. Why? Max correlation can easily be 1 for uncorrelated rv’s but the max of dCor for uncorrelated variables is < 0.85.

Unsolved dCor problems

• Using subsampling construct confidence interval for dCor^2. Why (not) bootstrap?

• Definition of Complexity of function f via dCor (X, f(X))

• Find sup dCor (X,Y) for uncorrelated X and Y.

Energy and U, V

Lecture 3. Brownian Correlation / Covariance

XXidid:= X – E(X) = id(X) – E(id(X)|id(.)):= X – E(X) = id(X) – E(id(X)|id(.))

CovCov2 2 (X, Y)= E(X(X, Y)= E(XididX’X’ id idYYidid’’Y’Y’id’id’))

XXWW:=W(X) – E(W(X)|W(.)):=W(X) – E(W(X)|W(.))

CovCov22WW(X,Y):=E(X(X,Y):=E(XWWXX’’WWYYWW’’YY’’WW’’))

Remark: Remark: Cov Covidid(X,Y) = |Cov(X,Y)|(X,Y) = |Cov(X,Y)|

Theorem: dCov (X,Y)= CovTheorem: dCov (X,Y)= CovW W (X,Y)(X,Y) (!!) (!!)

Szekely (2009) Ann. Appl. Statist 3/4 Discussion PaperSzekely (2009) Ann. Appl. Statist 3/4 Discussion Paper What if Brownian motion is replaced by another stochastic process? What matters is What if Brownian motion is replaced by another stochastic process? What matters is

the (positive definite) covariance function of the process.the (positive definite) covariance function of the process.

Why Brownian?

We can replace BM by any two stochastic processes U=U(t) and V=V(t)

CovCov22U,VU,V(X,Y):=E(X(X,Y):=E(XUUX’X’UUYYVVYY’’VV))

But why is this generalization good, how to But why is this generalization good, how to compute, how to apply?compute, how to apply?

The covariance function of BM isThe covariance function of BM is

2min(s,t)= |s| + |t| -|s-t|.2min(s,t)= |s| + |t| -|s-t|.

Fractional BM

The simplest extension isThe simplest extension is

|s||s|αα + |t| + |t|αα -|s-t| -|s-t|αα

and a zero mean Gaussian process with this cov is the fractional BM defined and a zero mean Gaussian process with this cov is the fractional BM defined

for 0 < for 0 < αα < 2. This process was mentioned for the first time in < 2. This process was mentioned for the first time in

Kolmogorov(1940). Kolmogorov(1940). αα = 2H where H is the Hurst exponent. = 2H where H is the Hurst exponent. Fractal dimension Fractal dimension D= 2-H. D= 2-H. H describes the raggedness of the resultant motion, with a higher value leading to a smoother motion. The value of H determines what kind of process the fBm is:

•if H = 1/2 then the process is in fact a Brownian motion or Wiener process;•if H > 1/2 then the increments of the process are positively correlated;•if H < 1/2 then the increments of the process are negatively correlated.

The increment process, X(t) = BH(t+1) − BH(t), is known as fractional Gaussian noise.

http://en.wikipedia.org/wiki/Brownian_motion

http://en.wikipedia.org/wiki/Wiener_process

http://en.wikipedia.org/wiki/Correlation

VariogramWhat properties of the (fractional) BM we need to make sure that the cov wrt certain stochastic processes is What properties of the (fractional) BM we need to make sure that the cov wrt certain stochastic processes is ““energyenergy”” type i.e. it depends type i.e. it depends on the distances on the distances

of observations only?of observations only? In spatial statistics the In spatial statistics the variagramvariagram 2 2γγ(s,t) of a random field Z(t) is(s,t) of a random field Z(t) is 22γγ(s,t):= Var(Z(s) –Z(t)).(s,t):= Var(Z(s) –Z(t)).

Suppose E(Z(t))=0. For stationary processes

γγ(s,t):=(s,t):= γ γ(s-t)(s-t) and for stationary isotropic ones:and for stationary isotropic ones: γγ(s,t):=(s,t):= γ γ(|s-t|)(|s-t|)

A function is a variogram of a zero expected value process/field iff it is conditionally negative definite (see later). If the covariance function C(s,t) of a process exists then

2C(s,t) = 2E[(Z(s)Z(t)]= EZ(s)2 2 +EZ(t)2 2 +E[Z(t)-Z(s)]2 2 = =

γ(s,s) + γ(t,t) – 2 γ(s-t).

For BM we had 2min(s,t)= |s| + |t| - 2|s-t|.

We also have the converse: γ(s-t)= C(s,s) + C(t,t) – 2C(s,t).

CovCov22U,VU,V(X,Y) is of “energy type” if the increments of U,V are stationary isotropic.(X,Y) is of “energy type” if the increments of U,V are stationary isotropic.

Special Gaussian processes

The negative log of the symmetric Laplace

ch.f is γ(t):=log(1 + |t|22)) defines a Laplace-Gaussian process with the corresponding C(s,t) because this γ is conditionally negative definite.

The negative log of the ch.f. of the difference of two iid Poisson is γ(t):= cos t - 1. This defines a Poisson-Gaussian process.

Correlation wrt stochastic processes

When the covariance counts only we can assume the processes are Gaussian.

Why do we need this generalization?

Conjecture. We need this generalization if the observations (Xt t Ytt) are not iid but stationary ergodic? Then consider cor wrt zero mean (Gaussian) processes with stationary increments having the same cov as (Xt t Ytt)?

A property of squared distances

What exactly are the properties of distances in Euclidean spaces (and Hilbert spaces) that we need for statistical inferences? We need the following properties of squared distances |x-y|^2 in Hilbert spaces.

Let H be a real Hilbert space, xii in H. Then we have that if aii in R and Σii aii =0 then

Σijij aii ajj |xii – xjj|²= - 2| Σijij aii xii |² ≤ 0

Thus if yii is in H i=1,2,…, n is another set of elements from H then

Σijij aii ajj |xii – yjj|² = -2 Σijij aii ajj xii yjj ≤ -| Σii aii (xii + yii)|² ≤ 0.

This is what we call the (conditional) negative definite property of |x-y|^2.

Negative definite functions

Let the data come from an arbitrary set S.

A function h(x,y): SxS → R is negative definite if

h(x,y) = h(y,x) (symmetric), h(x,x) = 0

and for all real numbers a ii if Σi i aii = 0 then

Σijij aii ajj h(xii, yjj) ≤ 0. (*)

The function h is strongly negative definite if (*) is true and equality in (*) holds iff all aii =0.

Theorem (I. J. Schoenberg (1938)) A metric space (S,d) embeds in a Hilbert space iff h= d^2 is negative definite.

Further examples

h(x,y):= |x-y|α α is negative definite if 0 < α ≤ 2, strictly negative definite if 0 < α < 2.

This is equivalent with the claim that exp{-|t|αα} is a characteristic function (of a symmetric stable distribution).

Classical statistics was built on α = 2.

This makes classical formulae simpler but because the “strictness” does not hold here, the corresponding “quadratic theorems” apply to “quadratic type distributions” only e.g. Gaussian distributions whose densities are exp{ quadratic polynomial}

See also least squares

For α = 2 life is simple (~ multivariate normal) but not always good, for 0 < α < 2 life is not so simple but good (nonparametric).

My “energy” inferences are based on strictly negative definite kernels.

Why do we need negative definite functions?

Let pii and qii be two probability distributions on the points x ii , yii , resp. Let X, Y be independent rv’s: P(X=x ii ) = pii , P(Y = yii ) = qii . Then the strong definiteness of h(x,y) implies that if a ii = pii – qii then

Σij ij (pii– qii) (pjj – qjj) |xii – yjj| ≤ 0

i.e. if E denotes the expectation of a random variables then the potential energy of (X,Y)

E(X,Y):= E|X-Y|+E|X’-Y’| - E|X-X’| - E|Y-Y’| ≥ 0 (*)

where X’ and Y’ are iid copies of X and Y resp. Strong negative definiteness implies that equality holds iff X and Y are identically distributed. What it means is that the double centered expected distance of X and Y, i.e. the potential energy of (X,Y), is always nonnegative and equals zero iff X and Y are identically distributed.

High school example

n “red” cities xii are on two sides of a line L (river), k of them of the left side, n-k on the right; similarly, m “green” cities yii are on the left side of the same line, n-m are the the right. We connect two cities if they are on different sides of the river. Red cities are connected with red, greens with green, and mixed with blue.

Claim: 2#blue - #red - # green ≥ 0 and = 0 iff k=m

Hint: k(n-m)+m(n-k) – k(n-k)-m(n-m) = (k-m)² ≥ 0

Combine this with M. W. Crofton (1868) integral geometry formula on random lines to get Energy.

Newton’s potential energy – Statistical potential energy

Newton’s potential energy in our 3-dim space is proportional to the reciprocal of the distance; if r:= |x-y| denotes the distance of points x,y, then the potential energy is proportional to 1/r. The mathematical significance of this function is that it is harmonic, i.e. 1/r is the fundamental solution of the Laplace equation. In 1 dimension r itself is harmonic.

For statistical applications what is relevant is that r^{α} is strictly negative definite iff 0 < α < 2. Statistical potential energy is the double centered version of

E|X- Y|α for 0 < α < 2

E(X,Y):= 2E|X-Y|α - E|X-X’|α - E|Y-Y’|α ≥ 0 for for 0 < α < 2.

SEPSuppose for simplicity that the kernel of a V statistic has two arguments: h= h(x11, x22). This is the situation if we want to check that X= given distribution. But what if the sample is SEP? Stationary = ? Ergodic = ?

Even in this case the SLLN holds and thus

(1/n22) Σi,ji,j h(Xii, Xjj) → Eh(X, X’) a.s.

thus we have strongly consistent estimators.

If h has rank 1 and the sample is iid then the limit distribution of 1/n \sum Σi,ji,j h(Xii, Xjj) is Q= Σkk λkk Z k k22 where λkk are

eigenvalues of the Hilbert-Schmidt operator ∫ h(x11, x22 ) Ψ(x22)dF(x22) = λΨ(x11).

We know that in general this is not true for SEP e.g. if h=x11x2 .2 .

We can still compute the eigenvalues μkk k=1,2,…,n of the random operator ( nxn random matrix)

(1/n) [h(Xii, Xjj); i,j =1,2,…,n]

and we can consider the corresponding Gaussian quadratic form Q= Σk=1k=1nn μk k Zkk

22.

Can the critical values for the corresponding null hypothesis be computed from Q if CLT holds e.g. if we have martingale difference structure or mixing/weak dependence or dCor→ 0 (we need to approach Gauss distribution)

What to do with kernels like h(x11, x22, x33, x44), etc. and how to test independence from SEP?

Testing independence of ergodic processes

If we have strongly stationary ergodic sequences then by the SLLN for V-statistics we know that the empirical dCor converges a.s. to the population dCor and this is constant a.s. So we have a consistent estimator for the population dCor. But how can we test if dCor=0 i.e. if the X process is independent of the Y process?

Permutation tests won’t work. Limit theorems to Q depend on the dependence structure so it is complicated. How about the t-test? For this we need a kind of CLT.

What is the question?

Do we want to test if Xtt is independent of Yt t

or if the X sequence is independent of the Y sequence?

Example. Let Xt t t=1,2,… be iid and t=1,2,… be iid and Yt t = = Xt+1 t+1

Then Xtt is independent of Yt t but the Y sequence is a shift of the X sequence so they are but the Y sequence is a shift of the X sequence so they are

not independent. We can now test if (not independent. We can now test if (Xt t , , Xt+1t+1))

is independent of (is independent of (Yt t , , Yt+1t+1) , etc. using permutation test.) , etc. using permutation test.

Null: The X st process is independent of the Y st process

Test if p-tuples of consecutive observations with random starting points are independent e.g. with p= √n.

Proof of a conjecture of Ibragimov-Linnik

How to avoid mixing condition in CLT?

Thm. Let Xnn , n=0, +-1, +- 2, …be a strictly stationary , n=0, +-1, +- 2, …be a strictly stationary

sequence, Esequence, EXnn=0 S=0 Snn=X=X11 +…+ X +…+ Xnn

(i) s(i) sn:n:= [Var(S= [Var(Snn)])]1/2 = nf(n) where f(n) is a slowly varying

function,

(ii) CorWW ( (SS-m-m/s/sm m , (S , (Sr+mr+m-S-Smm)/s)/smm) → 0 ) → 0

as m,r→∞ and r/m → 0 and as m,r→∞ and r/m → 0 and

(iii) (S(iii) (Smm/s/smm))2 is uniformly integrable. is uniformly integrable.

Then the CLT holds. Then the CLT holds. (Bakirov, N. K and Szekely, G.J. (Bakirov, N. K and Szekely, G.J. Brownian covariance and central limit theorem for stationary sequences, Theory of Probability and Its Applications, Vol. 55, No. 3, 371-394, 2011.)

ER

ER in our case has two meanings: Emergency Room and Energy in R, i.e. Energy programs in the program package R.

Classical emergency toolkits of statisticians contain things like t-test, F-test, ANOVA, tests of independence, Pearson’s correlation, etc. Most of them are based on the assumption that the underlying distribution is Gaussian. Our first aid toolkit is a collection of programs that are based on the notion of energy and they do not assume that the underlying distribution is Gaussian.

ANOVA is replaced by DISCO, Ward’s hierarchical clustering is replaced by energy clustering, Pearson’s correlation by distance correlation, etc. We can also test if a distribution is (multivariate) Gaussian. We suggest statisticians use our Energy package in R, ER, as a first aid for analyzing data in the Emergency Room. ER of course cannot replace further scrutiny of specialists.

ReferencesSzékely, G.J. (1985-2005) Technical Reports on Energy (E-)statistics and on distance

correlation. Potential and Kinetic energy in Statistics.

Székely, G.J. and Rizzo, M. L. and Bakirov, N.K. (2007) Measuring and testing independence by correlation of distances, Ann. Statistics 35/6, 2769-2794.

Székely, G. J. and Rizzo, M. L (2009) Brownian distance covariance, Discussion paper, Ann. Applied Statistics. 3 /4 1236-1265.

Lyons, R. (2013) Distance covariance in metric spaces, Ann. Probability. 41/5, 3284-3305.

Szekely, G.J., Rizzo, M. L. (2013) Energy Statistics: A class of statistics based on distances, JSPI,

Invited paper.

Documents

Distance Correlation E-Statistics