137
1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological Cybernetics http://www.ism.ac.jp/~fukumizu/ Machine Learning Summer School 2007 August 20-31, Tübingen, Germany

Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

1

Kernel Methods for Dependence and Causality

Kenji Fukumizu

Institute of Statistical Mathematics, TokyoMax-Planck Institute for Biological Cybernetics

http://www.ism.ac.jp/~fukumizu/

Machine Learning Summer School 2007August 20-31, Tübingen, Germany

Page 2: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

2

Overview

Page 3: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

3

Outline of This LectureKernel methodology of inference on probabilities

I. Introduction

II. Dependence with kernels

III. Covariance on RKHS

IV. Representing a probability

V. Statistical test

VI. Conditional independence

VII. Causal inference

Page 4: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

4

I. Introduction

Page 5: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

5

DependenceCorrelation

– The most elementary and popular indicator to measure the linear relation between two variables.

Correlation coefficient (aka Pearson correlation)

][][],[YVarXVar

YXCovXY =ρ ( )( )[ ]

( )[ ] ( )[ ]22 ][][][][YEYEXEXE

YEYXEXE−−

−−=

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

X

Y

X

Y

ρ = 0.94 ρ = 0.55

Page 6: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

6

Nonlinear dependence

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5-0.5

0

0.5

1

1.5

2

2.5

-1.5 -1 -0.5 0 0.5 1 1.5-0.5

0

0.5

1

1.5

2

2.5

X

Y

Corr( X, Y ) = -0.06Corr( X2,Y ) = 0.09 Corr( X3,Y ) = -0.38

Corr( sin(πX), Y) = 0.93

X

Y Corr( X, Y ) = 0.17

Corr( X2,Y )= 0.96

Page 7: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

7

“Uncorrelated” does not mean “independent”

They are all uncorrelated!

Note:

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

X1

Y1

X2

Y2

X3

Y3

independent independentdependent

TZZ

TTZZ AAVAZEZZEZAEV =−−= ]])[])([([~~

,~~

~ AZYX

Z =⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎞⎜⎝

⎛=

YX

ZIf and

⎟⎠

⎞⎜⎝

⎛=

1001

ZZV

Page 8: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

8

Nonlinear statistics with kernels– Linear methods can consider only linear relation.

– Nonlinear transform of the original variable may help. X (X, X2, X3, …)

But, • It is not clear how to make a good transform, in particular, if the

data is high-dimensional.• A transform may cause high-dimensionality.

e.g.) dim X = 100 XiXj # combinations = 4950

Why not use the kernelization / feature map for the transform?

Page 9: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

9

Kernel methodology for statistical inference– Transform of the original data by “feature map”.

– Is this simply “kernelization”? – Yes, in a big picture. – But, in this methodology, the methods have

clear statistical/probabilistic meaning in the original space, e.g. independence, conditional independence, two-sample test etc.

– From the side of statistics, it is a new approach using p.d. kernels.

RKHS (functional space)

xiΦ Hk

Ω xj

,ixφ

jxφ

Space of original data

feature map

Let’s do linear statistics in the feature space!

Goal: To understand how linear methods in RKHS solve classical inference problems on probabilities.

Page 10: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

10

– In this lecture, “kernel” means “positive definite kernel”.– In statistics, “kernel” is traditionally used in more general

meaning, which does not impose positive definiteness.

e.g. kernel density estimation (Parzen window approach)

k(x1, x2) is not necessarily positive definite.

– Statistical jargon• “in population”: evaluated with probability e.g. • “empirical”: evaluated with sample e.g.

• “asymptotic”: when the number of data goes to infinity.

∑=

=N

iiXxk

Nxp

1),(1)(

∫= )(][ xxdPXE

∑=

N

iiX

N 1

1

NXNi i∑ =1“ asymptotically converges to .”][XE

Remarks on Terminology

Page 11: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

11

II. Dependence with KernelsPrologue to kernel methodology for

inference on probabilities

Page 12: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

12

Independence of VariablesDefinition– Random vectors X on Rm and Y on Rn are independent ( )

Basic properties– If X and Y are independent,

– If further (X,Y) has the joint p.d.f pXY(x,y), and X and Y have the marginal p.d.f. pX(x) and pY(y), resp, then

( ) ( ) ( )BYAXBYAX ∈∈=∈∈ PrPr,Prfor any nm BA BB ∈∈ ,

def.

)]([)]([)]()([ YgEXfEYgXfE =

)()(),( ypxpyxp YXXY =

X Y

X Y ⇔

Page 13: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

13

Review: Covariance MatrixCovariance matrix

: m and n dimensional random vectors

Covariance matrix VXY of X and Y is defined by

In particular,

– VXY = 0 if and only if X and Y are uncorrelated.

For a sample empirical covariance matrix

,),,( 1T

mXXX K=

TTTXY YEXEXYEYEYXEXEV ][][][]])[])([[( −=−−≡

TYYY ),,( 1 lK=

TTXX XEXEXXEV ][][][ −≡

(m x n matrix)

),(,),,( )()()1()1( NN YXYX K

TNi

Ni

NTii

XY YN

XN

YXN iii

V ⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛−= ∑∑∑ )()()()( 111ˆ (m x n matrix)

=== 111

Page 14: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

14

Independence of Gaussian variablesMultivariate Gaussian (normal) distribution

Independence of Gaussian variables– X, Y: Gaussian random vectors of dim p and q (resp.)

⎟⎠⎞

⎜⎝⎛ −−−= − )()(

21exp

||)2(1),;( 1

2/12/ μμπ

μφ xVxV

Vx Tm

),(~),,( 1 VNXXX m μK= : m-dimensional Gaussian random variable with mean μ and covariance matrix V.

Probability density function (p.d.f.)

OVXY =⇔

“independent” “uncorrelated”TT YEXEXYE ][][][ =⇔X Y

2/12/12/ ||||)2(1),(

YYXXmXY VV

xypπ

= ⎟⎟

⎞⎜⎜

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−−

Y

X

YY

XXT

Y

X

yx

VOOV

yx

μμ

μμ

1

1

21exp )()( ypxp YX=

If VXY = O, )Q

Page 15: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

15

Independence by Nonlinear Covariance Independence and nonlinear covariance– X and Y are independent

0)](),([ =YgXfCov for all measurable functions f and g.

)Q Take f(x) = IA(x) and g(y) = IB(y) for measurable sets A and B.0)]([)]([)]()([ =− YIEXIEYIXIE BABA

( ) ( ) ( )BYAXBYAX ∈∈=∈∈ PrPr,Pr

1 IA(x)

A

indicator function of A

Page 16: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

16

Measuring all the nonlinear covariance

– Questions.• How can we calculate the value?

The space of measurable functions is large, containing noncontinuous and weird functions

• With finite number of data, how can we estimate the value?

)](),([sup,

YgXfCovgf can be used for the dependence measure.

Page 17: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

17

Using Kernels: COCORestrict the functions in RKHS

X , Y : random variables on ΩX and ΩY , resp. Prepare RKHS (HX, kX) and (HX , kX) defined on ΩX and ΩY, resp

Estimation with data: i.i.d. sample

YXYX HHHgHf gfYgXfCov

||||||||)](),([

sup, ∈∈

)(,),( ,1,1 NN YXYX K

YXYX HH

emp

HgHf gfYgXfCov

||||||||)]ˆ(),ˆ([

sup, ∈∈

∑ ∑∑= ==

−=N

i

N

ii

N

iiiiemp Yg

NXf

NYgXf

NYgXfCov

1 11)(1)(1)()(1)]ˆ(),ˆ([

COnstrained COvariance(COCO, Gretton et al. 05)

Page 18: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

18

Solution to COCO– The empirical COCO is reduced to an eigenproblem:

βα YXT

N GGmax1 1,1 tosubj. == ββαα YT

XT GG

NGG YX

2/12/1 of aluesingular vlargest =YXYX HH

emp

HgHfemp gf

YgXfCov||||||||

)]ˆ(),ˆ([supCOCO

, ∈∈=

nXnX QKQG =

),(, jiXijX XXkK = TNNNnn IQ 111−= (projector on ) ⊥

N1

GX and GY are the centered Gram matrices defined by

where ( )T

N 1,,1K=1

(N x N matrix)

For a symmetric positive semidefinite matrix A, A1/2 is a symmetric positive semidefinite matrix such that (A1/2)2 = A.

Page 19: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

19

Derivation∑ ∑∑= == ⎭

⎬⎫

⎩⎨⎧

−⎭⎬⎫

⎩⎨⎧

−=N

i

N

jji

N

jjiemp Yg

NYgXf

NXf

NYgXfCov

1 11)(1)()(1)(1)]ˆ(),ˆ([

∑=

== ∑ ⋅−⋅∑ ⋅−⋅=N

i

Nj jYNiY

Nj jXNiX gYkYkXkXkf

N 11

11

1 ,),(),(),(),(,1

Xm̂ Ym̂It is sufficient to consider (representer theorem)

{ },ˆ),(1∑ = −⋅= Nj XjXj mXkf α { }∑ = −⋅= N

YY mYkg 1 ˆ),(l llβ

∑ ∑ ∑= = = −⋅−⋅= Ni

N Nj YiYYYjNemp mYkmYkYgXfCov 1 1 1

1 ˆ),(,ˆ),()]ˆ(),ˆ([ l llβα

XjXXiX mXkmXk ˆ),(,ˆ),( −⋅−⋅×βα YX

TN GG1=

βα 2/12/1 , YX GvGu ==vGGu YX

T

vuN2/12/1

,1 max

By using

1||||,1|||| tosubj. == vu

1||||,1|||| 22 ==== ββαα YT

HXT

H GgGfYX

Maximize it under the constraints

Page 20: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

20

Quick Review on RKHSReproducing kernel Hilbert space (RKHS, review)Ω: set.

pos. def. kernel

H : reproducing kernel Hilbert space (RKHS)such that k is the reproducing kernel of H , i.e.1)2) is dense in H. 3)

– Feature map

R→Ω×Ω:k

Hxk ∈⋅ ),( for all .Ω∈x

)(),,( xffxk H =⋅ (reproducing property)

1∃

{ }Ω∈⋅ xxk |),(Span

),(,: xkxH ⋅→ΩΦ a

)(),( xffx =Φ (reproducing property)

),()(.. xkxei ⋅=Φ

Page 21: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

21

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

Example with COCO

X

Y

X

Y

X

Y

Independent IndependentDependent

0 0.5 1 1.58

10

12

14

16

18

20

22

rotation angle

CO

CO

emp

0 π/2

Gaussian kernels are used.

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

2exp),(

σyx

yxkG

Page 22: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

22

COCO and IndependenceCharacterization of independence

X and Y are independent 0||||||||

)](),([sup

,=

∈∈YXYX HHHgHf gfYgXfCov

This equivalence holds if the RKHS are “rich enough” to expressall the dependence between X and Y. (discussed later in Part IV.)

For the moment, Gaussian kernels are used to guarantee this equivalence.

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

2exp),(

σyx

yxkG

Page 23: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

23

HSIC (Gretton et al. 05)How about using other singular values?

0 0.5 1 1.50

0.02

0.04

0.06

0.08

0.1

0.12

rotation angle0 π/2

1st SV of 2/12/1YX GG

2nd SV of 2/12/1YX GG

Smaller singular values also represent dependence.

HSIC [ ]YXFYX

N

ii GG

NGG

NNTr111

2

22/12/12

1

22 ==≡ ∑

(γi : the i-th singular values of )2/12/1YX GG

][Tr|||| 1,22 MMMM TN

ji ijF == ∑ =|| ||F: Frobenius norm

Page 24: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

24

Example with HSIC

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

Rotation angle (θ)

HSIC

COCO

0 π/2

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

X

Y

X

Y

X

Y

independent independentdependent

Page 25: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

25

Summary of Part IIEmpirical Population

Linear(finite dim.)

Kernel )](),([sup1||||||||

YgXfCovYHXH gf ==

1st SV of 2/12/1YX GG

],[max1||||||||

YbXaCov TT

ba ==bVa XY

T

ba 1||||||||max

===

= 1st SV of XYV1st SV of XYV̂

Empirical Population

Kernel22/12/1FYX GG

2ˆFXYV

2FXYV

(Sum of SV2 of cov. matrix)

COCO

HSIC

What is the population version?

Linear(finite dim.)

Page 26: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

26

III. Covariance on RKHS

Page 27: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

27

Two Views on Kernel MethodsAs a good class of nonlinear functionsObjective functional for a nonlinear method

Find the solution within a RKHS.– Reproducing property / kernel trick, Representer theorem

c.f. COCO in the previous section.

Kernelization of linear methods– Map the data into a RKHS, and apply a linear method

– Map the random variable into a RKHS, and do linear statistics!

))(),...,((max 1 NfXfXfΨ f : nonlinear function

)( ii XX Φa

)(XX Φa random variable on RKHS

Page 28: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

28

Covariance on RKHS– Linear case (Gaussian):

Cov[X, Y] = E[YXT] – E[Y]E[X]T : covariance matrix

– On RKHS:X , Y : random variables on ΩX and ΩY , resp. Prepare RKHS (HX, kX) and (HY , kY) defined on ΩX and ΩY, resp.Define random variables on the RKHS HX and HY by

Define the big (possibly infinite dimensional) covariance matrix ΣYXon the RKHS.

),()( XkX XX ⋅=Φ ),()( YkY YY ⋅=Φ

ΩX ΩY

ΦX ΦY

HX HY

X Y

ΦX(X) ΦY(Y)

YXΣ

Page 29: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

29

Cross-covariance operator– Definition

There uniquely exists an operator from HX to HY such that

– A bit loose expression]),([)]([]),()([ ⋅ΦΦ−⋅ΦΦ=Σ XEYEXYE XYXYYX

)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σ

for all YX HgHf ∈∈ ,

c.f. Euclidean caseVYX = E[YXT] – E[Y]E[X]T : covariance matrix

: Cross-covariance operatorYXΣ

( ) )],(),,[(, XaYbCovaVb YX =

Page 30: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

30

IntuitionSuppose X and Y are R-valued, and k(x,u) admits the expansion

With respect to the basis 1, u, u2, u3, …, the random variables on RKHS are expressed by

L++++= 333

22211),( uxcuxcxucuxk

~YXΣ

⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜

O

L

MMMM

L

L

L

],[],[],[0],[],[],[0],[],[],[0

0000

3323

2323

313

3232

2222

212

331

221

21

XYCovcXYCovccXYCovccXYCovccXYCovcXYCovccXYCovccXYCovccXYCovc

TXcXcXcuXkX ),,,,1(~),()( 33

221 K=Φ

TYcYcYcuYkY ),,,,1(~),()( 33

221 K=Φ

The operator ΣYX contains the information on all the higher-order correlation.

)exp(),( xuuxk =e.g.)

Page 31: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

31

Addendum on “operator”– “Operator” is often used for a linear map defined on a functional

space, in particular, of infinite dimension.

– ΣYX is a linear map from HX to HY, as the covariance matrix VYX is a linear map from Rm to Rn.

– If you are not familiar with the word “operator”, simply replace it with “linear map” or “big matrix”.

– If you are very familiar with the operator terminology, you can easily prove ΣYX is a bounded operator. (Exercise)

Page 32: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

32

Characterization of IndependenceIndependence and Cross-covariance operatorIf the RKHS’s are “rich enough” to express all the moments,

– c.f. for Gaussian variables

OXY =Σ⇔X and Y are independent

( is always true. requires the richness assumption. Part IV.)

)]([)]([)]()([ XfEYgEXfYgE =

for all YX HgHf ∈∈ ,

or0)](),([Cov =YgXf

OVXY =⇔X and Y are independent i.e. uncorrelated

Page 33: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

33

Measures for DependenceKernel measures for dependence/independenceMeasure the “norm” of ΣYX.

– Kernel generalized variance (KGV, Bach&Jordan 02, FBJ 04)

– COCO

– HSIC

– HSNIC

YYXX

XYXYYXKGVΣΣ

Σ=

detdetdet

),( ]][[

YX HH

YX

gfYX gf

fgYXCOCO

||||||||,

sup),(0,0

Σ=Σ=

≠≠

2),( HSYXYXHSIC Σ=

22/12/1),(HSXXYXYYYXHSNIC −− ΣΣΣ= (explained later)

Page 34: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

34

Norms of operators

– Operator norm

c.f. the largest singular value of a matrix

– Hilbert-Schmidt normA is called Hilbert-Schmidt if for complete orthonormal systems

of H1 and of H2 if

Hilbert-Schmidt norm is defined by

21: HHA →

AfgAfAgff

,supsup1||||,1||||1|||| ===

==

operator on a Hilbert space

∑ ∑ ∞<j i ij A .,2

ϕψ

{ }iϕ { }jψ

∑ ∑= j i ijHS AA22 , ϕψ

c.f. Frobenius norm of a matrix

Page 35: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

35

Empirical EstimationEstimation of covariance operatori.i.d. sampleAn estimator of ΣYX is given by

– Note• This is again an operator.• But, it operates essentially on the finite dimensional space

spanned by the data ΦX(X1),…, ΦX(XN) and ΦY(Y1),…, ΦY(YN)

)(,),( ,1,1 NN YXYX K

{ }∑=

⋅−⋅−⋅=ΣN

iXiXYiY

NYX mXkmYk

N 1

)( ,ˆ),(ˆ),(1ˆ

,),(1ˆ1

1∑=

⋅=N

iiX Xk

Nm

where∑=

⋅=N

iiY Yk

Nm

12 ),(1ˆ

Page 36: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

36

Empirical cross-covariance operatorProposition (Empirical mean)

Proposition (Empirical covariance)

{ }{ }∑=

−−=ΣN

iii

NYX XfEXfYgEYg

Nfg

1

)( )]([ˆ)()]([ˆ)(1ˆ,),( YX HgHf ∈∀∈∀

∑ = ⋅= Ni iNX Xkm 1

1 ),(ˆ gives the empirical mean:

)]([ˆ)(1,ˆ1

XfEXfN

fmN

iiX ≡= ∑

=

)( XHf ∈∀

)(ˆ NYXΣ gives the empirical covariance

Xm̂)(ˆ N

YXΣ

: empirical mean element (in RKHS)

: empirical cross-covariance operator (on RKHS)

Page 37: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

37

COCO RevisitedCOCO = operator norm

fgYXCOCO YXgf

YX Σ=Σ===

,sup),(1||||,1||||

fgYXCOCO NYX

gf

NYXemp

)(

1||||,1||||

)( ˆ,supˆ)ˆ,ˆ( Σ=Σ===

with data

2/12/1 of aluesingular vlargest 1YX GG

N×=

)]ˆ(),ˆ([sup1||||||||

YgXfCovempgf ==

= previous definition

Page 38: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

38

HSIC RevisitedHSIC = Hilbert-Schmidt Information Criterion

[ ]YXHSN

YXemp GGN

YXHSIC Tr1ˆ)ˆ,ˆ( 2

2)( =Σ=

2),( HSYXYXHSIC Σ=

with data

[ ])()(2)( ˆˆTrˆ NXY

NYXHS

NYX ΣΣ=Σ

{ } ⎥⎦

⎤⎢⎣

⎡⋅−⋅−⋅−⋅−⋅= ∑ ∑

= =,ˆ),(ˆ),(,ˆ),(ˆ),(1Tr

1 12 YjY

N

i

N

jXjXXiXYiY mYkmXkmXkmYk

N

YiYYjY

N

i

N

jXjXXiX mYkmYkmXkmXk

Nˆ),(,ˆ),(ˆ),(,ˆ),(1

1 12 −⋅−⋅−⋅−⋅= ∑ ∑

= =

[ ]YX GGN

Tr12=

)Q

Page 39: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

39

Application of HSIC to ICAIndependent Component Analysis (ICA)– Assumption

m independent source signalsm observations of linearly mixed signals

– ProblemRestore the independent signals S from observations X.

A

s1(t)

s2(t)

s3(t)x3(t)

x2(t)

x1(t)

A: mxm invertiblematrix

)()( tAStX =

BXS =ˆ B: mxm orthogonal matrix

Page 40: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

40

ICA with HSIC

Pairwise-independence criterion is applicable.

Objective function is non-convex. Optimization is not easy.Approximate Newton method has been proposed

Fast Kernel ICA (FastKICA, Shen et al 07)

Other methods for ICASee, for example, Hyvärinen et al. (2001).

∑ ∑= >

=m

a abba YYHSICBL

1),()( BXY =

)()1( ,..., NXX : i.i.d. observation (m-dimensional)

Minimize

(Software downloadable at Arthur Gretton’s homepage)

Page 41: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

41

Experiments (speech signal)

A

s1(t)

s2(t)

x3(t)

x2(t)

x1(t)

randomlygenerateds3(t)

B

Fast KICA

Three speech signals

Page 42: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

42

Normalized CovarianceCorrelation – normalized varianceCovariance is not normalized well: it depends on the variance of X, Y. Correlation is better normalized

NOrmalized Cross-Covariance Operator (FBG07)

– Operator norm is less than or equal to 1, i.e.

2/12/1 −−XXYXYY VVV

2/12/1 −− ΣΣΣ= XXYXYYYXW

2/12/1XXYXYYYX W ΣΣ=Σ

Definition: there is a factorization of the ΣYX such that

NOCCO

1|||| ≤YXW

Page 43: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

43

Empirical estimation of NOCCO: sample

Relation to Kernel CCA– See Bach & Jordan 02, Fukumizu Bach Gretton 07

)(,),( ,1,1 NN YXYX K

( ) ( ) 2/1)()(2/1)()( ˆˆˆˆ −−+ΣΣ+Σ= IIW N

NXX

NYXN

NYY

NYX εε

εN: regularization coefficient

Note: is of finite rank, thus not invertible)(ˆ NXXΣ

Page 44: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

44

Normalized Independence MeasureHS Normalized Independence Criterion (HSNIC)Assume is Hilbert-Schmidt

Characterizing independenceTheorem

Under some “richness” assumptions on kernels (see Part IV).

2/12/1 −− ΣΣΣ= XXYXYYYXW

22/12/12||||HSXXYXYYHSYXWHSNIC −− ΣΣΣ==

( ) ( )[ ]112)( Trˆ −− ++== NNYYNNXXHSN

YXemp INGGINGGWHSNIC εε

HSNIC = 0 if and only if X and Y are independent.

(Confirm this – exercise)

Page 45: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

45

Kernel-free ExpressionIntegral expression of HSNIC without kernelsTheorem (FGSS07)

Assume that is dense in , and the laws PXand PY have p.d.f. w.r.t. the measures μ1 and μ2, resp.

– HSNIC is defined by kernels, but it does not depend on the kernels.Free from the choice of kernels!

– HSNICemp gives a kernel estimator for the Mean Square Contingency.

2|||| HSYXWHSNIC =

)()()()(1)()(

),(21

2

ydxdypxpypxp

yxpYX

YX

XY μμ∫∫ ⎟⎟⎠

⎞⎜⎜⎝

⎛−=

= Mean Square Contingency

R+⊗ YX HH )(2YX PPL ⊗

Page 46: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

46

PROS

CONS

HSIC HSNIC

• Simple to compute

• Asymptotic distribution for independence testis known (Part V)

• Does not depend on the kernels in population

• The value depends onthe choice of kernels

• Regularization coefficientis needed.

• Matrix inversion is needed.

• Asymptotic distribution for independence test is not known.

(Some experimental comparisons are given in Part V.)

Page 47: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

47

Choice of KernelHow to choose a kernel?– Recall: in supervised learning (e.g. SVM), cross-validation (CV) is

reasonable and popular. – For unsupervised problems, such as independence measures,

there are no theoretically reasonable methods.

– Some heuristic methods which work:• Heuristics for Gaussian kernels

• Make a related supervised problem, if possible, and use CV.

– More studies are required.

{ }jiXX ji ≠− |σ = median

Page 48: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

48

Relation with Other MeasuresMutual Information

MI and HSNIC

)()(1)()(

),(),( 21 ydxdypxp

yxpyxpHSNICYX

XYXY μμ∫∫ ⎟⎟

⎞⎜⎜⎝

⎛−=

∫∫= )()()()(

),(log),(),( ydxdypxp

yxpyxpYXMI YXYX

XYXY μμ

),(),( YXMIYXHSNIC ≤

MIydxdypxp

yxpyxpYX

XYXY =≤ ∫∫ )()(

)()(),(log),( 21 μμ

)Q

( )1log −≤ zz

fukumizu
取り消し線
fukumizu
テキストボックス
>= (correction. June 2014)
fukumizu
テキストボックス
>= (correction. June 2014)
fukumizu
Page 49: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

49

– Mutual Information: • Information-theoretic meaning.• Estimation is not straightforward for continuous variables.

Explicit estimation of p.d.f. is difficult for high-dimensional data. – Parzen-window is sensitive to the band-width. – Partitioning may cause a large number of bins.

• Some advanced methods: e.g. k-NN approach (Kraskov et al.).

– Kernel method:• Explicit estimation of p.d.f. is not required;

the dimension of data does not appear explicitly, but it is influential in practice.

• Kernel / kernel parameters must be chosen.

– Experimental comparisonSee Section V (Statistical Tests)

Page 50: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

50

Summary of Part IIICross-Covariance operator– Covariance on RKHS: extension of covariance matrix – If the kernel defines a rich RKHS,

Kernel-based dependence measures– COCO: operator norm of – HSIC: Hilbert-Schmidt norm of– HSNIC: Hilbert-Schmidt norm of normalized cross-covariance

operator

HSNIC = mean square contingency (in population) kernel free!

– Application to ICA

OXY =Σ⇔X Y

XYΣ

XYΣ

2/12/1 −− ΣΣΣ= XXYXYYYXW

Page 51: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

51

IV. Representing a Probability

Page 52: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

52

Statistics on RKHSLinear statistics on RKHS

– Basic statistics Basic statisticson Euclidean space on RKHS

Mean Mean element Covariance Cross-covariance operatorConditional covariance Conditional-covariance operator

– Plan: define the basic statistics on RKHS and derive nonlinear/nonparametric statistical methods in the original space.

Ω (original space)Φ

feature map H (RKHS)

XΦ (X) = k( , X)

YXΣ

(Part VI)

Page 53: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

53

Mean on RKHS– Empirical mean on RKHS

: i.i.d. sample : sample on RKHS

Empirical mean

– Mean element on RKHSX : random variable on Ω Φ(X) : random variable on RKHS.

Define

)()1( ,..., NXX ( ) ( )NXX ΦΦ ,,1 K

∑∑==

⋅=Φ=N

ii

N

iiX Xk

NX

Nm

11),(1)(1ˆ

[ ])(, XfEfmX = )( Hf ∈∀

)]([ XEmX Φ=

)]([ˆ)(1,ˆ1

XfEXfN

fmN

iiX ≡= ∑

=

)( XHf ∈∀

Page 54: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

54

Representation of ProbabilityMoments by a kernelExample of one-variable

• As a function of u, the mean element mX contains the information on all the moments – “richness” of RKHS.

• It is natural to expect that mX “represents” or “characterizes” a probability under “richness” assumption on the kernel.

L++++== 333

22211)exp(),( uxcuxcxucxuuxk

[ ] [ ] [ ] [ ] L++++== 333

22211),()( uXEcuXEcuXEcuXkEum XXXXX

-4 -3 -2 -1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

pX

pY0][

1][

0][

3

2

=

=

=

XE

XE

XE

3][ 4 =XE0][

1][

0][

3

2

=

=

=

YE

YE

YE

5/9][ 4 =YE

Page 55: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

55

Characteristic KernelRichness assumption on kernels

P : family of all the probabilities on a measurable space (Ω, B).H: RKHS on Ω with measurable kernel k. mP: mean element on H for the probability

– DefinitionThe kernel k is called characteristic if the mapping

is one-to-one.

– The mean element of a characteristic kernel uniquely determines the probability.

P∈P

PmPH a,→P

YXYX PPmm =⇔=

Page 56: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

56

– “Richness” assumption in the previous sections should be replaced by “kernel is characteristic” or the following denseness assumption.

– Sufficient condition Theoremk: kernel on a measurable space (Ω, B). H: associated RKHS. If H + R is dense in Lq(P) for any probability P on (Ω, B), then k is characteristic

– Examples of characteristic kernel• Gaussian kernel on the entire Rm

• Laplacian kernel on the entire Rm

( )22 2exp),( σyxyxkG −−= )0( >σ

( )∑ = −−= mi iiL yxyxk 1 ||exp),( λ )0( >λ

.1≥q

Page 57: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

57

Universal kernel (Steinwart 02)A continuous kernel k on a compact metric space Ω is called

universal if the associated RKHS is dense in C(Ω), the functional space of the continuous functions on Ω with sup norm.

Example: Gaussian kernel on a compact subset of Rm

PropositionA universal kernel is characteristic.

– Characteristic kernels are wider class, and suitable for discussing statistical inference of probabilities.

– Universal kernels are defined only on compact sets.

– Gaussian kernels are characteristic either on a compact subset and the entire of Euclidean space.

Page 58: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

58

Two-Sample ProblemTwo i.i.d. samples are given;

Are they sampled from the same distribution?

– Practically important. We often wish to distinguish two things:

– Are the experimental results of treatment and control significantly different?

– Were the plays “Henry VI” and “Henry II” written by the same author?

– Kernel solution: Use the differencewith a characteristic kernel such as Gaussian.

)()1( ,..., XNXX .,..., )()1( YNYYand

YX mm −

Page 59: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

59

– Example: do they have the same distribution?

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

N(0,I)

N(0,I)

N(0,I)

Unif5.0),0(5.0 +INUnif

N(0,I)N = 100

Page 60: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

60

Kernel Method for Two-sample Problem

Maximum Mean Discrepancy (Gretton etal 07, NIPS19)– In population

– Empirically

– With characteristic kernel, MMD = 0 if and only if PX = PY.

22HYX mmMMD −=

2empMMD

∑∑ ∑∑== ==

+−=YX YX N

baba

Y

N

i

N

aai

YX

N

jiji

XYYk

NYXk

NNXXk

N 1,2

1 11,2 ),(1),(2),(1

2ˆˆ HYX mm −=

Page 61: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

61

Experiment with MMD

0 0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

0.03

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02NX = NY = 100 NX = NY = 200

NX = NY = 500

N(0,1) vsc Unif + (1-c) N(0,1)

Means of MMD over 100 samples

N(0,1) vs N(0,1)

c

Page 62: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

62

Characteristic Function– Definition

X: random vector on Rm with law PX

Characteristic function of X is a complex-valued function defined by

If PX has p.d.f. pX(x), the char. function is Fourier transform of pX(x).

– Moment generating function

– Chrac. function is very popular in probability and statistics for characterizing a probability.

[ ] ∫ −− =≡ )()( 11 xdPeeEu XxuXu

XTT

ξ )( mu R∈

[ ]rXr

r

r XEudud

=−

)(1

1 ξ

Page 63: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

63

Characterizing property

TheoremX, Y: random vectors on Rm with prob. law PX, PY (resp.).

YXYX PP =⇔= ξξ

Page 64: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

64

Kernel and Ch. FunctionFourier kernel is positive definite

– Characteristic function is a special case of the mean element.

Generalization of characteristic function approach– There are many “characteristic function” methods in the statistical

literature (independent test, homogeneity test, etc). – The kernel methodology discussed here is generalizing this

approach. • The data may not be Euclidean, but can be structured.

( )yxyxk TF 1exp),( −= is a (complex-valued) pos. def. kernel.

)],([)( uXkEu FX =ξ = mean element with kF(x,y) !!

Page 65: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

65

Re: Representation of ProbabilityVarious ways of representing a probability– Probability density function p(x)

– Cumulative distribution function FX(t) = Prob( X < t )

– All the moments E[X], E[X2], E[X3], …

– Characteristic function

– Mean element on RKHS mX(u) = E[k(X, u)]

Each representation provides methods for statistical inference.

[ ] ∫ −− =≡ )()( 11 xdPeeEu XxuXu

XTT

ξ

Page 66: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

66

Summary of Part IVStatistics on RKHS Inference on probabilities– Mean element Characterization of probability

Two-sample problem– Covariance operator Dependence of two variables

Independence test, Dependence measures– Conditional covariance operator Conditional independence

(Section VI)

Characteristic kernel– A characteristic kernel gives a “rich” RKHS– A characteristic kernel characterizes a probability. – Kernel methodology is generalization of characteristic function

methods

Page 67: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

67

V. Statistical Test

Page 68: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

68

Statistical TestHow should we set the threshold?Example) Based on a dependence measure, we wish to make a

decision whether the variables are independent or not.

Simple-minded idea: Set a small value like t = 0.001I(X,Y) < t dependentI(X,Y) t independent

But, the threshold should depend on the property of X and Y.

Statistical hypothesis test– A statistical way of deciding whether a hypothesis is true or not. – The decision is based on sample We cannot be 100% certain.

Page 69: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

69

Procedure of hypothesis test• Null hypothesis H0 = hypothesis assumed to be true

“X and Y are independent”

• Prepare a test statistic TN

e.g. TN = HSICemp

• Null distribution: Distribution of TN under the null hypothesis This must be computed for HSICemp

• Set significance level α Typically α = 0.05 or 0.01

• Compute the critical region: α = Prob. of TN > tα under H0.

• Reject the null hypothesis if TN > tα,The probability that HSICemp > tα under

independence is very small.

otherwise, accept the null hypothesis negatively.

Page 70: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

70

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p.d.f. of Null distribution

area = α (5%, 1% etc)

threshold tα

- If the null hypothesis is the truth, the value of TN should follow the above distribution.

- If the alternative is the truth, the value of TN should be very large. - Set the threshold with risk α. - The threshold depends on the distribution of the data.

critical region

One-sided test

TN

area = p-value TN > tα ⇔ p-value < α

significance level

Page 71: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

71

Type I and Type II error– Type I error = false positive (e.g. dependence = positive)– Type II error = false negative

TRUTHAlternative

Rej

ect H

0A

ccep

t H0

Type I error

Type II errorTrue negative

True positiveFalse positive

False negative

H0

TES

T R

ES

ULT

Significance level controls the type I error. Under a fixed type I error, the type II error should be

as small as possible.

Page 72: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

72

Independence Test with HSICIndependence Test– Null hypothesis H0: X and Y are independent

Alternative H1: X and Y are not independent (dependent)

– Test statistics

– Null distribution

– Under alternative

empN NT HSIC×=

∑∞

=⇒

1

2

aaaN ZT λ convergence in distribution

)1,0(~ NZawhere i.i.d.λa are the eigenvalues of an integral equation (not shown here)

( ) )( ∞→= NNOT pN

))/1(( NOHSIC pemp =Under H0

Page 73: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

73

Example of Independent TestSynthesized data– Data: two d-dimensional samples

),...,(),...,,...,( )()(1

)1()1(1

Nd

Nd XXXX ),...,(),...,,...,( )()(

1)1()1(

1N

dN

d YYYY

strength of dependence

Page 74: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

74

Traditional Independence TestP.d.f.-based– Factorization of p.d.f. is used. – Parzen window approach. – Estimation accuracy is low for high dimensional data

Cumulative distribution-based– Factorization of c.d.f. is used.

Characteristic function-based– Factorization of characteristic function is used.

Contingency table-based– Domain of each variable is partitioned into a finite number of parts. – Contingency table (number of counts) is used.

And many others

)()(),...,( 111

mXX

mX tFtFttF mL=

)()(),...,( 11 mm xpxpxxp L=

Page 75: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

75

Power Divergence (Ku&Fine05, Read&Cressie)– Make partition : Each dimension is divided into q parts so

that each bin contains almost the same number of data.

– Power-divergence

– Null distribution under independence

Limitations– All the standard tests assume vector (numerical / discrete) data. – They are often weak for high-dimensional data.

JjjA ∈}{

∑ ∏∈ = ⎭

⎬⎫

⎩⎨⎧

−⎟⎠⎞

⎜⎝⎛

+==

Jj

N

k

kjjjN k

pppNmXIT 1ˆˆˆ)2(

2),(21

)(λ

λ

λλ

: frequency in Aj: marginal freq. in r-th interval

21−+−

⇒ NqNqN NT χ

jp̂)(ˆ k

rpI2 = Mean Square Conting.I0 = MI

Page 76: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

76

Independent Test on Text– Data: Official records of Canadian Parliament in English and French.

• Dependent data: 5 line-long parts from English texts and their French translations.

• Independent data: 5 line-long parts from English texts and random 5 line-parts from French texts.

– Kernel: Bag-of-words and spectral kernel

Topic Match BOW(N=10) Spec(N=10) BOW(N=50) Spec(N=50) HSICg HSICp HSICg HSICp HSICg HSICp HSICg HSICp

Agri- Random 1.00 0.94 1.00 0.95 1.00 0.93 1.00 0.95culture Same 0.99 0.18 1.00 0.00 0.00 0.00 0.00 0.00

Fishery Random 1.00 0.94 1.00 0.94 1.00 0.93 1.00 0.95Same 1.00 0.20 1.00 0.00 0.00 0.00 0.00 0.00

Immig- Random 1.00 0.96 1.00 0.91 0.99 0.94 1.00 0.95ration Same 1.00 0.09 1.00 0.00 0.00 0.00 0.00 0.00

(Gretton et al. 07)Acceptance rate (α = 5%)

Page 77: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

77

Permutation Test– The theoretical derivation of the null distribution is often difficult

even asymptotically. – The convergence to the asymptotic distribution may be very slow.

– Permutation test – Simulation of the null distribution• Make many samples consistent with the null hypothesis by

random permutations of the original sample. • Compute the values of test statistics for the samples.

Independence test

Two-sample test

• It can be computationally expensive.

X1 X2 X3 X4 X5 X6 X7 X1 X2 X3 X4 X5 X6 X7

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y5 Y1 Y7 Y4 Y2 Y6 Y3

X1 X2 X3 X4 X5 Y6 Y7 Y8 Y9 Y10 X4 Y8 X2 Y9 Y6 X1 X3 Y7 Y10 X5

independent

homogeneous

Page 78: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

78

0 2 4 6 8 10 12 140

100

200

300

400

500

600

700

Independence test for 2 x 2 contingency table – Contingency table

– Test statistic

– Example

Histogram by 1000 random permutations and true χ2.

Y

175 9371 161

X

∑=

−=

1,0, ,,

2,,

ˆˆ)ˆˆˆ(

ji jYiX

jYiXijN pp

pppNT

0

1

0 1

2χ⇒ )Hunder ,( 0∞→N

120102134144

Y

X0

1

0 1P-value by true χ2 = 0.193P-value by permutation = 0.175

Independence is accepted with α = 5%

many random permutations

Page 79: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

79

Independence test with various measures– Data 1: dependent and uncorrelated by rotation (Part I)

X and Y: one-dimensional, N = 200

Angle 0.0 4.5 9.0 13.5 18.0 22.5HSIC (Median) 93 92 63 5 0 0HSIC (Asymp. Var.) 93 44 1 0 0 0HSNIC (ε = 104, Median) 94 23 0 0 0 0HSNIC (ε = 106, Median) 92 20 1 0 0 0HSNIC (ε = 108, Median) 93 15 0 0 0 0HSNIC (Asymp. Var.) 94 11 0 0 0 0MI (#NN = 1) 93 62 11 0 0 0MI (#NN = 3) 96 43 0 0 0 0MI (#NN = 5) 97 49 0 0 0 0Conting. Table (#Bins=3) 100 96 46 9 1 0Conting. Table (#Bins=4) 98 29 0 0 0 0Conting. Table (#Bins=5) 98 82 5 0 0 0

indep. more dependent

# acceptance of independence out of 100 tests (α = 5%)

Page 80: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

80

– Data 2: Two coupled chaotic time series (coupled Hénon map)X and Y: 4-dimensional, N = 100

Coupling: 0.0 0.1 0.2 0.3 0.4 0.5 0.6HSIC 75 70 58 52 13 1 0HSNIC 97 66 21 1 0 1 0MI (#NN=3) 87 91 83 73 23 6 0MI (#NN=5) 87 88 75 67 23 5 0MI (#NN=7) 87 86 75 64 21 5 0

indep. more dependent

# acceptance of independence out of 100 tests (α = 5%)

Page 81: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

81

Two sample testProblemTwo i.i.d. samples

Null hypothesis H0:Alternative H1:

Homogeneity test with MMD (Gretton et al NIPS20)

Null distribution – Similar to independence test with HSIC (not shown here)

YX PP =

YX PP ≠

NXX ,...,1 NYY ,...,1

2empMMD×= NTN

{ }∑=

+−=N

jijijiji YYkYXkXXk

N 1,),(),(2),(1

Page 82: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

82

Experiment– Data integration

We wish to integrate two datasets into one. The homogeneity should be tested!

(Gretton et al. NIPS20, 2007)

Dataset Attribut. MMD2 t-test FR-WW FR-KS Neural I (w/wo spike) Same 96.5 100.0 97.0 95.0(N=4000,dim=63) Diff. 0.0 42.0 0.0 10.0Neural II (w/wo spike) Same 95.2 100.0 95.0 94.5(N=1000,dim=100) Diff. 3.4 100.0 0.8 31.8Microarray (health/tumor) Same 94.4 100.0 94.7 96.1(N=25,dim=12000) Diff. 0.8 100.0 2.8 44.0Microarray (subtype) Same 96.4 100.0 94.6 97.3(N=25,dim=2118) Diff. 0.0 100.0 0.0 28.4

A B C+

% acceptance of homogeneity

Page 83: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

83

Traditional Nonparametric TestsKolmogorov-Smirnov (K-S) test for two samplesOne-dimensional variables– Empirical distribution function

– KS test statistics

– Asymptotic null distribution is known (not shown here).

∑=

≤=N

iiN tXI

NtF

1)(1)(

)()(sup 21 tFtFD NNt

N −=∈R

DN FN1(t)

FN2(t)

t

Page 84: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

84

Wald-Wolfowitz run testOne-dimensional samples– Combine the samples and plot the points in ascending order. – Label the points based on the original two groups.– Count the number of “runs”, i.e. consecutive sequences of the same

label. – Test statistics

– In one-dimensional case, less powerful than KS test

Multidimensional extension of KS and WW test– Minimum spanning tree is used (Friedman Rafsky 1979)

R = Number of runs

)1,0(][][ N

RVarRERTN ⇒

−=

R = 10

Page 85: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

85

Summary of Part VStatistical Test – Statistical method of judging significance of a value. – It determines a “threshold” with some risk.

Statistical Test with kernels– Independence test with HSIC– Two-sample test with MMD2

– Competitive with the state-of-art methods of nonparametric tests.– Kernel-based statistical tests work for structured data, to which

conventional methods cannot be directly applied.

Permutation test– It works well, if applicable.– Computationally expensive.

Page 86: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

86

VI. Conditional Independence

Page 87: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

87

Re: Statistics on RKHSLinear statistics on RKHS

– Basic statistics Basic statisticson Euclidean space on RKHS

Mean Mean element Covariance Cross-covariance operatorConditional covariance Cond. cross-covariance operator

– Plan: define the basic statistics on RKHS and derive nonlinear/nonparametric statistical methods in the original space.

Ω (original space)Φ

feature map H (RKHS)

XΦ (X) = k( , X)

YXΣ

Page 88: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

88

Conditional IndependenceDefinitionX, Y, Z: random variables with joint p.d.f. X and Y are conditionally independent given Z, if

)|(),|( || zypxzyp ZYZXY =

)|()|()|,( ||| zypzxpzyxp ZYZXZXY =or

),,( zyxpXYZ

YX Z

With Z known, the information of Xis unnecessary for the inference on Y

(A)

(B)

YX

Z(B)

(A)

Page 89: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

89

Review: Conditional Covariance Conditional covariance of Gaussian variables– Jointly Gaussian variable

: m ( = p + q) dimensional Gaussian variable

– Conditional probability of Y given X is again Gaussian

),,(),,,( 11 qp YYYXXX KK ==),( YXZ =

),(~ VNZ μ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛=

YYYX

XYXX

Y

X

VVVV

V,μμ

μ

)(]|[ 1| XXXYXYXY xVVxXYE μμμ −+==≡ −

XYXXYXYYXYY VVVVxXYCovV 1| ]|[ −−==≡

),(~ || XYYXY VN μ

Cond. mean

Cond. covariance

Note: VYY|X does not depend on x

Schur complement of VXX in V

Page 90: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

90

Conditional Independence for Gaussian Variables

Two characterizationsX,Y,Z are Gaussian.

– Conditional covariance

– Comparison of conditional variance

X Y | Z OV ZXY =⇔ | i.e. OVVVV ZXZZYZYX =− −1

X Y | Z ZYYZXYY VV |],[| =⇔

YXZZXZXZXYYY VVVV ],[1

],][,[],[−− ( ) ⎟⎟

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−=

ZY

XY

ZZZX

XZXXYZYXYY V

VVVVV

VVV1

,

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛ −⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−

−=−

−ZY

XYZZXZ

ZZ

ZXX

ZXZZYZYXYY V

VIOVVI

VOOV

IVVOI

VVV1

1

1|

1,

ZXYZXXZYXZYY VVVV |1

|||−−=

)Q

Page 91: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

91

Linear Regression and Conditional Covariance

Review: linear regression– X, Y: random vector (not necessarily Gaussian) of dim p and q (resp.)

– Linear regression: predict Y using the linear combination of X.Minimize the mean square error:

– The residual error is given by the conditional covariance matrix.

][~],[~ YEYYXEXX −=−=

2

matrix:

~~min XAYEpqA

−×

[ ] [ ]]|[TrTr~~min |2

matrix:XYCovVXAYE XYYpqA

==−×

Page 92: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

92

– Derivation

– For Gaussian variables,

[ ]TTTTTT AXXAEAXYEYXAEYYEXAYE ]~~[]~~[]~~[]~~[Tr~~ 2+−−=−

[ ]TXX

TYXXYYY AAVAVAVV +−−= Tr

[ ]TXXYXXXXXYX VVAVVVA )()(Tr 11 −− −−= [ ]XYXXYXYY VVVV 1Tr −−+

1−= XXYXopt VVAand

[ ]XYXXYXYYopt VVVVXAYE 12Tr~~ −−=−

ZYYZXYY VV |],[| =

“If Z is known, X is not necessary for linear prediction of Y.”can be interpreted as

( )X Y | Z⇔

Page 93: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

93

Conditional Covariance on RKHSConditional Cross-covariance operatorX, Y, Z : random variables on ΩX, ΩY, ΩZ (resp.).(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on ΩX, ΩY, ΩZ (resp.).

– Conditional cross-covariance operator

Note: may not exist. But, we have the decomposition

Rigorously, define

– Conditional covariance operator

ZXZZYZYXZYX ΣΣΣ−Σ≡Σ −1|

2/12/1XXYXYYYX W ΣΣ=Σ

2/12/1| XXZXYZYYYXZYX WW ΣΣ−Σ≡Σ

1−ΣZZ

YX HH →

ZYZZYZYYZYY ΣΣΣ−Σ≡Σ −1|

Page 94: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

94

Two Characterizations of Conditional Independence with Kernels

(1) Conditional covariance operator (FBJ04, 06)Under some “richness” assumptions on RKHS (e.g Gaussian)– Conditional variance

– Conditional independence

– c.f. Gaussian variables

ZYYXZYY |][| Σ=ΣX Y | Z ⇔

X Y | Z ZYYZXYY VV |],[| =⇔

2|

~~min]|[ ZaYbZYbVarbVb TT

a

TZYY

T −==

[ ] 2| )(~)(~inf]|)([, ZfYgEZYgVarEgg

ZHfZYY −==Σ∈

X is not necessary for predicting g(Y)

Page 95: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

95

(2) Cond. cross-covariance operator (FBJ04, Sun et al. 07)Under some “richness” assumptions on RKHS (e.g. Gaussian),– Conditional Covariance

– Conditional independence

– c.f. Gaussian variables

OZXY =Σ |&& ( )OZXY =Σ⇔ |&&⇔

),,( ZXX =&& ),( ZYY =&&where

X Y | Z

X Y | Z OV ZXY =⇔ |

]|,[| ZYbXaCovbVa TTZXY

T =

[ ]]|)(),([, | ZXfYgCovEfg ZYX =Σ

Page 96: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

96

– Why is “extended variable” needed?

The l.h.s is not a funciton of z. c.f. Gaussian case

However, if X is replaced by [X, Z]

]|)(),([, | zZXfYgCovfg ZYX =≠Σ

[ ]]|)(),([, | ZXfYgCovEfg ZYX =Σ

dzzpzypzxpyxpOZYX )()|()|(),(| ∫=⇒=Σ

)|()|()|,(| zypzxpzyxpOZYX =⇒=Σ

dzzpzypzzxpzyxpOZZXY )()|()|',()',,(|],[ ∫=⇒=Σ

)'()|()|',( zzzxpzzxp −= δ

)'()'|()'|()',,( zpzypzxpzyxp =

where

i.e. )'|()'|()'|,( zypzxpzyxp =

Page 97: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

97

Application to Dimension Reduction for Regression

Dimension reductionInput: X = (X1, ... , Xm), Output: Y (either continuous or discrete)Goal: find an effective subspace spanned by an m x d matrix B s.t.

No further assumptions on cond. p.d.f. p.

Conditional independence

| |( | ) ( | )TT

Y X Y B Xp Y X p Y B X= BTX = (b1TX, ..., bd

TX)linear feature vector

XU V

Y

B spans effective subspace

X Y | BTX

where

Page 98: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

98

Kernel Dimension Reduction(Fukumizu, Bach, Jordan 2004, 2006)

Use d-dimensional Gaussian kernel kd(z1,z2) for BTX, and a characteristic kernel for Y.

|| T YY XYY B XΣ = Σ ⇔

|| T YY XYY B XΣ ≥ Σ

X Y | BTX

|:min Tr TT

dYY B XB B B I=

⎡ ⎤Σ⎣ ⎦

See FBJ 04, 06 for further details. (Extension: Nilsson et al. ICML07)

Very general method for dimension reduction:No model for regression, no strong assumption on the distributions.

Optimization is not easy.

( : the partial order of self-adjoint operators)≥

Page 99: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

99

Experiments with KDRWine dataData

13 dim. 178 data.3 classes2 dim. projection

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20σ = 30 CCA

Partial Least Square

Sliced Inverse Regression

-20 -10 0 10 20-20

-15

-10

-5

0

5

10

15

20 KDR

( )1 2

2 21 2

( , )

exp

k z z

z z σ= − −

Page 100: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

100

Measure of Cond. IndependenceHS norm of cond. cross-covariance operator– Measure for conditional dependence

– Conditional independenceUnder some “richness” assumptions (e.g. Gaussian),

– Empirical measure

2| HSZYXHSCIC &&&&Σ=

( )[ YZNNZXYXemp GGINGGGGHSCIC 12Tr −+−= ε( ) ( ) ]YZNNZXNNZZ GGINGGINGG 11 −− +++ εε

2| HSZYXHSCIC &&&&Σ= is zero if and only if X Y | Z

),(),,( ZYYZXX == &&&&

Page 101: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

101

Normalized Cond. CovarianceNormalized conditional cross-covariance operator

– Conditional independence Under some “richness” assumptions (e.g. Gaussian),

– HS Normalized Conditional Independence Criteria

ZXYZYXZYX WWWW −≡| Recall: 2/12/1XXYXYYYX W ΣΣ=Σ

( ) 2/112/12/1|

2/1|

−−−−− ΣΣΣΣ−ΣΣ=ΣΣΣ= XXZXZZYZYXYYXXZYXYYZYXW

OW ZXY =|&& X Y | Z⇔

2| HSZYXWHSNCIC &&&&=

X Y | Z⇔= 0HSNCIC

Page 102: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

102

– Kernel-free expression. Under some “richness” assumptions,

– Empirical estimator of HSNCIC

[ ]ZYZXZYXYXemp RRRRRRRRRHSNCIC &&&&&&&&&&&& +−= 2Tr

( ) 1−+≡ NNXXX INGGR ε&&&&&& etc.

2| |||| HSZXYW &&&&

dxdydzzypzxpzypzxp

zpzypzxpzyxpYZXZ

YZXZ

ZZYZXXYZ ),(),(),(),(

)()|()|(),,( 2||∫∫ ⎟⎟

⎞⎜⎜⎝

⎛ −=

(“Conditional” mean square contingency)

Page 103: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

103

Permutation test with the kernel measure

– If Z takes values in a finite set {1, …, L},set

otherwise, partition the values of Z into L subsets C1, …, CL, and set

– Repeat the following process B times: (b = 1, …, B)1. Generate pseudo cond. independent

data D(b) by permuting X data within each 2. Compute TN

(b) for the data D(b) .

– Set the threshold by the (1-α)-percentile of the empirical distributions of TN

(b).

2)(|

ˆHS

NZYXNT Σ=

2)(|

ˆHS

NZYXN WT =or

),,...,1(}|{ LZiA i === lll

).,...,1(}|{ LCZiA i =∈= lll

.lA

11 ,1,1 ii YX

22 ,1,1 ii YX

33 ,1,1 ii YX

44 ,2,2 ii YX22 ,2,2 ii YX

66 ,2,2 ii YX

77 ,, iLiL YX

88 ,, iLiL YX

99 ,, iLiL YX

1C

2C

LC

perm

ute

perm

ute

perm

ute

{{{

Conditional Independence Test

Approximate null distribution under cond. indep. assumption

Page 104: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

104

Application to Graphical Modeling– Three continuous variables of medical measurements. N = 35.

(Edwards 2000, Sec.3.1.4)Creatinine clearance (C), Digoxin clearance (D), Urine flow (U)

– Suggested undirected graphical model by kernel method

Kernel mehod (permut. test) Linear methodHSN(C)IC P-val. (partial) cor. P-val.

1.458 0.924 Parcor(D,U|C) 0.4847 0.00370.00000.07070.0010

0.776 <0.001 Cor(C,D) 0.77540.194 0.117 Cor(C,U) 0.30920.343 0.023 Cor(D,U) 0.5309

D U | CC DC UD U

C

D UThe conditional independencecoincides with the medical knowledge.

D U | C

Page 105: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

105

Statistical ConsistencyConsistency on conditional covariance operator

Theorem (FBJ06, Sun et al. 07)Assume and

In particular,

)(0ˆ|

)(| ∞→→Σ−Σ N

HSZYXN

ZYX

0→Nε ∞→NNε

)(ˆ|

)(| ∞→Σ→Σ N

HSZYXHSN

ZYX

i.e. HSCICemp converges to the population value HSCIC.

Page 106: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

106

Consistency of normalized conditional covariance operatorTheorem (FGSS07)

Assume that is Hilbert-Schmidt, and the regularization coefficient satisfies and Then,

In particular,

– Note: Convergence in HS-norm is stronger than convergence in operator norm.

)(0ˆ|

)(| ∞→→− NWW

HSZYXN

ZYX

0→Nε .3/1 ∞→NN ε

)(ˆ|

)(| ∞→→ NWW

HSZYXHSN

ZYX

i.e. HSNCICemp converges to the population value HSNCIC.

ZYXW |

Page 107: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

107

Summary of Part VConditional independence by kernels– Conditional independence is characterized in two ways;

• Conditional covariance operator

• Conditional cross-covariance operator

Kernel Dimensional ReductionA very general method for dimension reduction for regression

Measures for conditional independence – HS norm of conditional cross-covariance operator– HS norm of normalized conditional cross-covariance operator

Kernel free in population.

OZXY =Σ |&& OZXY =Σ |&&X Y | Z ⇔ or

ZYYXZYY |][| Σ=ΣX Y | Z ⇔

Page 108: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

108

VII. Causal Inference

Page 109: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

109

Causal InferenceWith manipulation – intervention

No manipulation / with temporal information

No manipulation / no temporal information

manipulate observation

X is a cause of Y?

)(tX )(tY : observed time series

X(1), …, X(t) are a cause of Y(t+1)?

Easier. (do-calculus, Pearl 1995)

Causal inference is harder.X

Y

XY

Page 110: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

110

Difficulty of causal inference from non-experimental data – Widely accepted view till 80’s

Causal inference is impossible without manipulating some variables.

e.g.) “No causation without manipulation” (Holland 1986, JASA)

– Temporal information is very helpful, but not decisive.e.g.) The barometer falls before it rains, but it does not cause

the rain.

– Many philosophical discussions, but not discussed here. See Pearl (2000) and the references therein.

Page 111: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

111

Correlation (dependence) and causalityDo not confuse causality with dependence (or correlation)!

Example)A study shows:

Young children who sleep with the light on are much more likely to develop myopia in later life. (Nature 1999)

light on short-sight

light on short-sight

Parental myopia

(Nature 2000)

Hidden common cause

Page 112: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

112

Causality of Time SeriesGranger causality (Granger 1969)X(t), Y(t): two time series t = 1, 2, 3, …– Problem:

Is {X(1), …, X(t)} a cause of Y(t+1)?

– Granger causality Model: AR

Test

X is called a Granger cause of Y if H0 is rejected.

(No inverse causal relation)

t

p

jj

p

ii UjtXbitYactY +−+−+= ∑∑

== 11)()()(

H0: b1 = b2 = … = bp = 0

Page 113: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

113

– F-test• Linear estimation

• Test statistics

– Software• Matlab: Econometrics toolbox (www.spatial-econometrics.com)• R: lmtest package

tpj j

pi i UjtXbitYactY +−+−+= ∑∑ == 11 )()()( ji bac ˆ,ˆ,ˆ

iac ˆ̂,ˆ̂

( )∑ += −= Npt tYtYERR 11 )()(ˆ ( )210 )()(ˆ̂∑ += −= N

pt tYtYERR

tpi i WitYactY +−+= ∑ =1 )()(

( )12,

1

10

)12( +−⇒+−

−≡ pNpN F

pNERRpERRERRT

p.d.f of 21,ddF xdxd

xddxd

xdddB

dd11

)2/,2/(1 21

21

1

21

1

21⎟⎟⎠

⎞⎜⎜⎝

⎛+

−⎟⎟⎠

⎞⎜⎜⎝

⎛+

=

)( ∞→N

H0:

under H0

Page 114: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

114

– Granger causality is widely used and influential in econometrics.Clive Granger received Nobel Prize in 2003.

– Limitations• Linearity: linear AR model is used.

No nonlinear dependence is considered.• Stationarity: stationary time series are assumed.• Hidden cause: hidden common causes (other time series)

cannot be considered.

“Granger causality” is not necessarily “causality” in general sense.

– There are many extensions.

– With kernel dependence measures, it is easily extended to incorporate nonlinear dependence. Remark: There are few good conditional independence tests

for continuous variables.

Page 115: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

115

Kernel Method for Causality of Time Series

Causality by conditional independence– Extended notion of Granger causality

X is NOT a cause of Y if

– Kernel measures for causality

),...,|(),...,,,...,|( 111 ptttpttpttt YYYpXXYYYp −−−−−− =

ptt XX −− ,...,1 ptt YY −− ,...,| 1tY

2)1(ˆ

HS

pNYHSCIC +−Σ=

pp Y|X&&

2)1(ˆ

HS

pNYWHSNCIC +−=

pp Y|X&&

},...,1|),,{( 2,1p NptXXX ppttt +=∈= −−− RX L

},...,1|),,{( 2,1p NptYYY ppttt +=∈= −−− RY L

Page 116: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

116

ExampleCoupled Hénon map– X, Y:

{ }

21 1 2

2 1

21 1 1 1 2

2 1

( 1) 1.4 ( ) 0.3 ( )( 1) ( )

( 1) 1.4 ( ) ( ) (1 ) ( ) 0.1 ( )

( 1) ( )

x t x t x tx t x t

y t x t y t y t y t

y t y t

γ γ

⎧ + = − +⎨

+ =⎩⎧ + = − + − +⎪⎨

+ =⎪⎩

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

x1-y1

γ = 0 γ = 0.25 γ = 0.8

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x1

x2

Page 117: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

117

Causality of coupled Hénon map– X is a cause of Y if γ > 0.

– Y is not a cause of X for all γ.

– Permutation tests for non-causality with

ptt XX −− ,...,1 ptt YY −− ,...,| 1tY

ptt YY −− ,...,1 ptt XX −− ,...,| 1tX

x1 – y1 H0: Yt is not a cause of Xt+1 H0: Xt is not a cause of Yt+1

γ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6

0 0

32

0

13

0

45

0

85

0

92

62

93

77

94

86

90

63

90

81

95

88

96

0.0

HSNCIC 94 97

Granger 92 96

2)1(ˆ

HS

pNYWHSNCIC +−=

pp Y|X&&

Number of times accepting H0 among 100 datasets (α = 5%)

N = 100

000000756881859396 9697HSNCIC1-dimensional independent noise is added to X(t) and Y(t).

Page 118: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

118

Causal Inference from Non-experimental Data

Why is it possible?– DAG of chain X – Z – Y

– This is the only detectable directed graph of three variables. – The following structures cannot be distinguished from the probability.

X Yand

| ZX Y

V-structure

p(x,y,z) = p(x|z)p(y|z)p(z) = p(x|z)p(z|y)p(y) = p(x|z)p(z|y)p(x)

X Y

| ZX ZYX ZYX Z

Y

X Y

Z

Page 119: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

119

Causal Learning MethodsConstraint-based method (discussed in this lecture)– Determine the (cond.) independence of the underlying probability. – Relatively efficient for hidden variables.

Score-based method– Structure learning of Bayesian network (Ghahramani’s lecture)– Able to use informative prior. – Optimization in huge search space.– Many methods assume discrete variables (discretization) or

parametric model.

Common hidden causes– For simplicity, algorithms assuming no hidden variables are

explained in this lecture.

Page 120: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

120

Fundamental AssumptionsMarkov assumption on a DAG– Causal relation is expressed by a DAG, and the probability

generating data is consistent with the graph.

Faithfulness (stability)– The inferred DAG (causal structure) must express all the

independence relations.

This includes the true probability as a special case, but the structuredoes not express

c

d

ba

c

d

ba

c

d

ba

a b

unfaithfultrue

)|(),|()()()( cdbacba XXpXXXpXpXpXp =

Page 121: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

121

Inductive CausationIC algorithm (Verma&Pearl 90)

Input – V: set of variables, D: dataset of the variables. Output – DAG (specifies an equivalence class, directed partially)

1. For each , search for such that

Construct an undirected graph (skeleton) by connecting a and bif and only if no set Sab can be found.

2. For each nonadjacent pair (a,b) with a – c – b, direct the edges by if

3. Orient as many of undirected edges as possible on condition that neither new v-structures nor directed cycles are created. (See the next slide for the precise implementation)

)(),( baVVba ≠×∈ },{\ baVSab ⊂

| SabXa Xb

bca ←→ abSc ∉

Page 122: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

122

Step 3 of IC algorithm– The following 4 rules are necessary and sufficient to direct all the

possible inferred causal direction (Verma & Pearl 92, Meek 95)1. If there is a triplet a b – c with a and c nonadjacent,

orient b – c into b c.

2. If for a – b there is a chain a c b, orient a – b into a b.

3. If for a – b there are two chains a – c b and a – d b such that c and d are nonadjacent, orient a – b into a b.

d

ca b

d

ca b

ca b

ca b

ca b ca b

Page 123: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

123

Example

True structurea

d

e

cb

a

d

e

cb

a

d

e

cb

a

d

e

cb

The output from each step of IC algorithm

},{ cbSad =}{dSae =}{aSbc =

}{dSS cebe ==

For other pairs,S does not exist.

For (b,c),

1) 2) 3)

bcSd ∉

Direction of some edges may be left undetermined.

Page 124: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

124

PC Algorithm(Peter Sprites & Clark Glymour 91)

– Linear method: partial correlation with χ2 test is used in Step 1.– Efficient computation for Step 1.

Start with complete graph, check Xa Xb | S only for , and connect the edge a—b if there is no such S.

i = 0. G = Complete graph. repeat

for each a in Vfor each b in Na

Check Xa Xb | S for with |S| = iIf such S exists,

set Sab = S, and delete the edge a—b from G.i = i + 1

until | Na | < i for all a

– Implemented in TETRAD (http://www.phil.cmu.edu/projects/tetrad/)

aNS ⊂

}{\ bNS a⊂

Page 125: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

125

Kernel-based Causal LeaningLimitations of the previous implementations of IC– Linear / discrete assumptions in Step 1.

Difficulty in testing conditional independence for continuous variables.

kernel method!

– Errors of the skeleton in Step 1 cannot be recovered in the later steps.

voting method

Page 126: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

126

KCL algorithm (Sun et al. ICML07, Sun et al. 2007)

– Dependence measure:

– Conditional dependence measure:

where the operator is defined by

Motivation: make and comparable

Theorem

2)()( ˆˆHS

NYX

NYX HSIC Σ==H

2

2)(|)(

|

ˆˆ

HSZZ

HSN

ZXYNZYX C

&&&&Σ≡H

[ ])()(, ZgZfEgCf ZZ =ZZZZ HHC →:

2)(ˆHS

NYXΣ

2)(|

ˆHS

NZXY &&&&Σ

(X, Y) Z, 2)(22)(|

ˆˆHS

NYXHSZZHS

NZXY C Σ=Σ &&&&If

Page 127: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

127

Outline of the KCL algorithm: IC algorithm is modified as follows:

KCL-1: Skeleton by statistical tests(1) Permutation tests of conditional independence

for all (X, Y, SXY) ( ) with the measure(2) Connect X and Y if no such SXY exists.

KCL-2: Majority votes for directing edgesFor all triplets X – Z – Y (X and Y may be adjacent), give a vote

to the direction X Z and Y Z if

Repeat this for (a) (rigorous v-structure)and (b) (relative v-structure)

Make an arrow to each edge if a vote is given ( “ ” is allowed).

KCL-3: Same as IC-3

)(|

ˆ NZYXH},{\ YXVSXY ⊂

| SXYX Y

λ>≡ )(

)(|

| ˆˆ

NYX

NZYX

ZXYMH

H

1>>λ{ }YXZXYZ MM || ,max=λ

Page 128: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

128

Illustration of KCL

true KCL-1 KCL-2 (a) KCL-2 (b) KCL-3

Heuristic assumption: M MM> ,

Conditioning common effect strengthens the dependence between the causes.

Page 129: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

129

Hidden common cause

– FCI (Fast Causal Inference, Spirtes et al. 93) extends PC to allow hidden variables.

– A bi-directional arrow ( ) given by KCL may be interpreted as a hidden common cause. Empirically confirmed, but no theoretical justification (Sun et al. 2007).

Page 130: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

130

Experiments with KCLSmoking and Cancer– Data (N = 44)

CIGARET: Cigarettes sales in 43 states in US and District of Columbia

BLADDER, LUNG, KIDNEY, LEUKEMIA: death rates from various cancers

– Results

BLADDER

LUNGCIGARET

KIDNEY

LEUKEMIA

BLADDER

LUNGCIGARET

KIDNEY

LEUKEMIA

FCI KCL

Page 131: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

131

Montana Economic Outlook Poll (1992)– Data: 7 discrete variables, N = 209

AGE (3), SEX (2), INCOME (3), POLITICAL (3), AREA (3), FINANCIAL status (3, better/same/worse than a year ago),OUTLOOK (2)

SEX

INCOME

AGE

FINANCIAL

OUTLOOK POLITICAL

AREA

SEX

INCOME

AGE

FINANCIAL

OUTLOOK POLITICAL

AREA

SEX

INCOME

AGE

FINANCIAL

OUTLOOK POLITICAL

AREA

FCI

BN-PC

KCL

BN-PC is a constraint-based method using MI (Chen et al. 2002)

Page 132: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

132

Summary of Part VICausality of time series– Kernel-based measures Nonlinear extension of Granger

causality

Causal inference from non-experimental data– Kernel-based Causal Learning (KCL) algorithm

• Constraint-based method: A variant of Inductive Causation– Conditional independence test with kernel measures– Voting method for directions

• More reasonable results are obtained than existing methods. See Sun et al. (2007) for more detailed comparisons.

Page 133: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

133

BibliographyPapers

Cheng, J., R. Greiner, J. Kelly, D. A. Bell, and W. Liu. Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence Journal, 137:43–90. (2002).

Friedman, J. and Rafsky, L. Multivariate generalization of the Wald-Wolfovitz and Smirnov two sample tests. Annals of Stat. 7:697-717 (1979).

Fukumizu, K., F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Leaning Research, 8:361-383 (2007)

Fukumizu, K., F. Bach, and M. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Leaning Research, 5:73-99 (2004).

Fukumizu, K., F. Bach, and M. Jordan. Kernel dimension reduction in regression. Tech Report 715, Dept. Statistics, University of California, Berkeley, 2006.

Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional Dependence. Submitted (2007)

Granger, C. W. J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37:424-438 (1969).

Gretton, A., A. J. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf and N. K. Logothetis. Kernel Constrained Covariance for Dependence Measurement. Proc. 10th Intern. Workshop on Artificial Intelligence and Statistics (AISTATS 2005), pp.112-119 (2005)

Page 134: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

134

Gretton, A., O. Bousquet, A. Smola and B. Schölkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. Algorithmic Learning Theory: 16th International Conference, ALT 2005, pp.63-78 (2005)

Gretton, A., K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems 19. MIT Press (2007).

Gretton, A., K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and A. Smola. A Kernel Statistical Test of Independence. Submitted (2007).

Ku, C. and Fine, T. Testing for Stochastic Independence: Application to Blind Source Separation. IEEE Trans. Signal Processing, 53(5):1815-1826 (2005).

Kraskov, A., H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69, 066138-1–16 (2004).

Meek, C. Causal inference and causal explanation with background knowledge. In P.Besnard and S.Hanks (Eds.), Uncertainty in Artificial Intelligence, vol. II, pp.403-410. Morgan-Kaufmann.

Nilsson, J. Sha, F. Jordan, M. Regression on manifolds using kernel dimension reduction. Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.697-704. (2007)

Pearl, J. Causal diagrams for empirical research. Biometrika 82, 669-710 (1995).Shen, H., S. Jegelka and A. Gretton: Fast Kernel ICA using an Approximate Newton Method.

Proc. 11th Intern. Workshop on Artificial Intelligence and Statistics (AISTAT2007).Spirtes, P. and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social

Science Computer Review 9:62-72.

Page 135: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

135

Spirtes, P., C. Meek and T. Richardson. Causal inference in the presence of latent variables and selection bias. Proc. 11th Conf. Uncertainty in Artificial Intelligence. pp 499-506 (1995).

Steinwart, I. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Leaning Research, 2, pp.67-93 (2002)

Sun, X., D. Janzing, B. Schölkopf, and K. Fukumizu. A kernel-based causal learning algorithm. Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.855-862. (2007)

Sun, X., D. Janzing, B. Schölkopf, K. Fukumizu, and A. Gretton. Learning Causal Structures via Kernel-based Statistical Dependence Measures. Submitted (2007)

Verma, T., J. Pearl. Equivalence and synthesis of causal models. Proc. 6th Conf. Uncertainty in Artificial Intelligence (UAI1990) pp.220-227 (1990)

Verma, T., J. Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. Proc. 8th Conf. Uncertainty in Artificial Intelligence (UAI1992) pp.323-330 (1992)

Holland, P.W. Statistics and causal inference. J. American Statistical Association 81: 945-960 (1986).

Graham E.Q., H.S. Chai, and M.G. Maguire, and R.A. Stone. Myopia and ambient lighting at night. Nature 399: 113 (May 13, 1999)

Zadnik, K., L.A. Jones, B.C. Irvin, R.N. Kleinstein, R.E. Manny, J.A. Shin, and D.O. Mutti. Myopia and ambient night-time lighting. Nature 404: 143-144 (9 March 2000)

Page 136: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

136

BooksHyvärinen, A. J. Karhunen, and E. Oja. Independent Component Analysis. Wiley

Interscience (2001).Pearl, J. Causality. Cambridge University Press (2000)Edwards, D. Introduction to graphical modelling. Springer verlag, New York (2000).Read, T. and Cressie, N. Goodness-of-fit Statistics or Discrete Multivariate Analysis.

Wiley, New York (1995)Spirtes, P., C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-

Verlag, New York (1993). (2nd ed. 2000)

Page 137: Kernel Methods for Dependence and Causality1 Kernel Methods for Dependence and Causality Kenji Fukumizu Institute of Statistical Mathematics, Tokyo Max-Planck Institute for Biological

137

Many thanks to my collaborators,

Bernhard Schölkopf, Arthur Gretton, Xiaohai Sun, Dominik Janzing,

and to many active students.