128
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School ubingen, 2017

Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Lecture 2

Hilbert Space Embedding of Probability Measures

Bharath K. Sriperumbudur

Department of Statistics, Pennsylvania State University

Machine Learning Summer School

Tubingen, 2017

Page 2: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Recap of Lecture 1Kernel method provides an elegant approach to achieve non-linearalgorithms from linear algorithms.

I Input space, X : the space of observed data on which learning isperformed.

I Feature map, Φ: defined through a positive definite kernel function,k : X × X → R

x 7→ Φ(x), x ∈ X

I Constructing linear algorithms in the feature space Φ(X ) translatesas non-linear algorithms in X .

I Elegance: No explicit construction of Φ as 〈Φ(x),Φ(y)〉 = k(x , y).

I Function space view: RKHS; smoothness and generalization

Examples

I Ridge regression. In fact many more(Kernel+SVM/PCA/FDA/CCA/Perceptron/logistic regression, ...)

Page 3: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Outline

I Motivating example: Comparing distributions

I Hilbert space embedding of measures

I Mean element

I Distance on probabilities (MMD)

I Characteristic kernels

I Cross-covariance operator and measure of independence

I Applications

I Two-sample testing

I Choice of kernel

Page 4: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Co-authors

I Sivaraman Balakrishnan (Statistics, Carnegie Mellon University)

I Kenji Fukumizu (Institute for Statistical Mathematics, Tokyo)

I Arthur Gretton (Gatsby Unit, University College London)

I Gert Lanckriet (Electrical Engineering, University of California, San Diego)

I Krikamol Muandet (Mathematics, Mahidol University, Bangkok)

I Massimiliano Pontil (Computer Science, University College London)

I Bernhard Scholkopf (Max Planck Institute for Intelligent Systems, Tubingen)

I Dino Sejdinovic (Statistics, University of Oxford)

I Heiko Strathmann (Gatsby Unit, University College London)

I Ilya Tolstikhin (Max Planck Institute for Intelligent Systems, Tubingen)

Page 5: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Motivating Example: Coin Toss

I Toss 1: T H H H T T H T T H H T H

I Toss 2: H T T H T H T T H H H T T

Are the coins/tosses statistically similar?

Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample fromQ:=Bernoulli(q).

Is p = q or not?, i.e., compare

EP[X ] =

∫{0,1}

x dP(x) and EQ[X ] =

∫{0,1}

x dQ(x).

Page 6: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Motivating Example: Coin Toss

I Toss 1: T H H H T T H T T H H T H

I Toss 2: H T T H T H T T H H H T T

Are the coins/tosses statistically similar?

Toss 1 is a sample from P:=Bernoulli(p) and Toss 2 is a sample fromQ:=Bernoulli(q).

Is p = q or not?, i.e., compare

EP[X ] =

∫{0,1}

x dP(x) and EQ[X ] =

∫{0,1}

x dQ(x).

Page 7: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Coin Toss Example

In other words, we compare∫R

Φ(x) dP(x) and

∫R

Φ(x) dQ(x)

where Φ is an identity map,

Φ(x) = x .

A positive definite kernel corresponding to Φ is

k(x , y) = 〈Φ(x),Φ(y)〉2 = xy ,

which is a linear kernel on {0, 1}. Therefore, comparing two Bernoulli isequivalent to ∫

{0,1}k(y , x) dP(x)

?=

∫{0,1}

k(y , x) dQ(x)

for all y ∈ {0, 1}, i.e., compare the expectations of the kernel.

Page 8: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing two Gaussians

P = N(µ1, σ21) and Q = N(µ2, σ

22)

Comparing P and Q is equivalent to comparing µ1, µ2 and σ21 , σ2

2 , i.e.,

EP[X ] =

∫Rx dP(x)

?=

∫Rx dQ(x) = EQ[X ]

and

EP[X 2] =

∫Rx2 dP(x)

?=

∫Rx2 dQ(x) = EQ[X 2].

Concisely ∫R

Φ(x) dP(x)?=

∫R

Φ(x) dQ(x)

whereΦ(x) = (x , x2).

Compare the first moment of the feature map

Page 9: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing two Gaussians

P = N(µ1, σ21) and Q = N(µ2, σ

22)

Comparing P and Q is equivalent to comparing µ1, µ2 and σ21 , σ2

2 , i.e.,

EP[X ] =

∫Rx dP(x)

?=

∫Rx dQ(x) = EQ[X ]

and

EP[X 2] =

∫Rx2 dP(x)

?=

∫Rx2 dQ(x) = EQ[X 2].

Concisely ∫R

Φ(x) dP(x)?=

∫R

Φ(x) dQ(x)

whereΦ(x) = (x , x2).

Compare the first moment of the feature map

Page 10: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing two Gaussians

Using the map Φ, we can construct a positive definite kernel as

k(x , y) = 〈Φ(x),Φ(y)〉R2 = xy + x2y2

which is a polynomial kernel of order 2.

Therefore, comparing two Gaussians is equivalent to∫Rk(y , x) dP(x)

?=

∫Rk(y , x) dQ(x)

for all y ∈ R, i.e., compare the expectations of the kernel.

Page 11: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing general P and QMoment generating function is defined as

MP(y) =

∫Rexy dP(x)

and (if it exists) captures the information about a distribution, i.e.,

MP = MQ ⇔ P = Q.

Choosing

Φ(x) =

(1, x ,

x2

√2!, . . . ,

x i

√i !, . . .

)∈ `2(N), ∀ x ∈ R

it is easy to verify that

k(x , y) = 〈Φ(x),Φ(y)〉`2(N) = exy

and so ∫Rk(x , y) dP(x) =

∫Rk(x , y) dQ(x), ∀ y ∈ R⇔ P = Q.

Page 12: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Two-Sample Problem

I Given random samples {X1, . . . ,Xm}i.i.d.∼ P and

{Y1, . . . ,Yn}i.i.d.∼ Q.

I Determine: P = Q or P 6= Q ?

Applications:

I Microarray data (aggregation problem)

I Speaker verification

I Independence Testing: Given random samples

{(X1,Y1), . . . , (Xn,Yn)} i.i.d∼ Pxy . Does Pxy factorize into PxPy ?

I Feature selection (microarrays, image and text,. . .)

Page 13: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Hilbert Space Embedding of Measures

Page 14: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Hilbert Space Embedding of Measures

I Canonical feature map:

Φ(x) = k(·, x) ∈ H, x ∈ X

where H is a reproducing kernel Hilbert space (RKHS).

I Generalization to probabilities:

x 7→ k(·, x) ≡ δx︸︷︷︸point mass at x

7→ k(·, x)︸ ︷︷ ︸∫X k(·,y) dδx (y)=Eδx [k(·,Y )]

Based on the above, the map is extended to probability measures as

P 7→ µP :=

∫X

Φ(x) dP(x) =

∫Xk(·, x) dP(x)︸ ︷︷ ︸EX∼Pk(·,X )

(Smola et al., ALT 2007)

Page 15: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Properties

I µP is the mean of the feature map and is called the kernel mean ormean element of P.

I When is µP well defined?∫X

√k(x , x) dP(x) <∞ ⇒ µP ∈ H

Proof:

‖µP‖H =

∥∥∥∥∫X

k(·, x) dP(x)

∥∥∥∥H

Jensen′s≤

∫X‖k(·, x)‖H dP(x).

I We know that for any f ∈ H, f (x) = 〈f , k(·, x)〉H. So, for anyf ∈ H,∫Xf (x) dP(x) =

∫X〈f , k(·, x)〉H dP(x)

•=

⟨f ,

∫Xk(·, x) dP(x)

⟩H

= 〈f , µP〉H.

Page 16: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Properties

I µP is the mean of the feature map and is called the kernel mean ormean element of P.

I When is µP well defined?∫X

√k(x , x) dP(x) <∞ ⇒ µP ∈ H

Proof:

‖µP‖H =

∥∥∥∥∫X

k(·, x) dP(x)

∥∥∥∥H

Jensen′s≤

∫X‖k(·, x)‖H dP(x).

I We know that for any f ∈ H, f (x) = 〈f , k(·, x)〉H. So, for anyf ∈ H,∫Xf (x) dP(x) =

∫X〈f , k(·, x)〉H dP(x)

•=

⟨f ,

∫Xk(·, x) dP(x)

⟩H

= 〈f , µP〉H.

Page 17: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Properties

I µP is the mean of the feature map and is called the kernel mean ormean element of P.

I When is µP well defined?∫X

√k(x , x) dP(x) <∞ ⇒ µP ∈ H

Proof:

‖µP‖H =

∥∥∥∥∫X

k(·, x) dP(x)

∥∥∥∥H

Jensen′s≤

∫X‖k(·, x)‖H dP(x).

I We know that for any f ∈ H, f (x) = 〈f , k(·, x)〉H. So, for anyf ∈ H,∫Xf (x) dP(x) =

∫X〈f , k(·, x)〉H dP(x)

•=

⟨f ,

∫Xk(·, x) dP(x)

⟩H

= 〈f , µP〉H.

Page 18: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Interpretation

Suppose k is translation invariant on Rd , i.e.,k(x , y) = ψ(x − y), x , y ∈ Rd . Then

µP =

∫Rd

ψ(· − x) dP(x) = ψ ? P,

where ? is the convolution of ψ and P.

I Convolution is a smoothing operation ⇒ µP is a smoothed versionof P.

I Example: Suppose P = δy , a point mass at y . Then

µP = ψ ? P = ψ(· − y).

I Example: Suppose ψ ∝ N(0, σ2) and P = N(µ, τ 2). Then

µP = ψ ? P ∝ N(µ, σ2 + τ 2).

µP is a wider Gaussian than P

Page 19: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Interpretation

Suppose k is translation invariant on Rd , i.e.,k(x , y) = ψ(x − y), x , y ∈ Rd . Then

µP =

∫Rd

ψ(· − x) dP(x) = ψ ? P,

where ? is the convolution of ψ and P.

I Convolution is a smoothing operation ⇒ µP is a smoothed versionof P.

I Example: Suppose P = δy , a point mass at y . Then

µP = ψ ? P = ψ(· − y).

I Example: Suppose ψ ∝ N(0, σ2) and P = N(µ, τ 2). Then

µP = ψ ? P ∝ N(µ, σ2 + τ 2).

µP is a wider Gaussian than P

Page 20: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Interpretation

Suppose k is translation invariant on Rd , i.e.,k(x , y) = ψ(x − y), x , y ∈ Rd . Then

µP =

∫Rd

ψ(· − x) dP(x) = ψ ? P,

where ? is the convolution of ψ and P.

I Convolution is a smoothing operation ⇒ µP is a smoothed versionof P.

I Example: Suppose P = δy , a point mass at y . Then

µP = ψ ? P = ψ(· − y).

I Example: Suppose ψ ∝ N(0, σ2) and P = N(µ, τ 2). Then

µP = ψ ? P ∝ N(µ, σ2 + τ 2).

µP is a wider Gaussian than P

Page 21: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 22: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 23: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 24: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 25: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 26: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

Define a distance (maximum mean discrepancy) on probabilities

MMDH(P,Q) = ‖µP − µQ‖H

(Gretton et al., NIPS 2006; Smola et al., ALT 2007)

MMD2H(P,Q) = 〈µP, µP〉H + 〈µQ, µQ〉H − 2〈µP, µQ〉H

=

∫XµP(x) dP(x) +

∫XµQ(x) dQ(x)− 2

∫XµP(x) dQ(x)

=

∫X

∫Xk(x , y) dP(x) dP(y) +

∫X

∫Xk(x , y) dQ(x) dQ(y)

− 2

∫X

∫Xk(x , y) dP(x) dQ(y)

= EPk(X ,X ′)︸ ︷︷ ︸avg. similarity between points from P

+ EQk(Y ,Y ′)︸ ︷︷ ︸avg. similarity between points from Q

− 2 · EP,Qk(X ,Y )︸ ︷︷ ︸avg. similarity between points from P and Q

.

Page 27: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

In the motivating examples, we compare P and Q by comparing

µP(y) =

∫Xk(y , x) dP(x) and µQ(y) =

∫Xk(y , x) dQ(x), ∀ y ∈ X .

For any f ∈ H,

‖f ‖∞ = supy∈X|f (y)| = sup

y∈X|〈f , k(·, y)〉H| ≤ sup

y∈X

√k(y , y)‖f ‖H.

‖µP − µQ‖∞ ≤ supy∈X

√k(y , y)‖µP − µQ‖H.

Does ‖µP − µQ‖H = 0⇒ P = Q? (More on this later)

Page 28: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

In the motivating examples, we compare P and Q by comparing

µP(y) =

∫Xk(y , x) dP(x) and µQ(y) =

∫Xk(y , x) dQ(x), ∀ y ∈ X .

For any f ∈ H,

‖f ‖∞ = supy∈X|f (y)| = sup

y∈X|〈f , k(·, y)〉H| ≤ sup

y∈X

√k(y , y)‖f ‖H.

‖µP − µQ‖∞ ≤ supy∈X

√k(y , y)‖µP − µQ‖H.

Does ‖µP − µQ‖H = 0⇒ P = Q? (More on this later)

Page 29: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Comparing Kernel Means

In the motivating examples, we compare P and Q by comparing

µP(y) =

∫Xk(y , x) dP(x) and µQ(y) =

∫Xk(y , x) dQ(x), ∀ y ∈ X .

For any f ∈ H,

‖f ‖∞ = supy∈X|f (y)| = sup

y∈X|〈f , k(·, y)〉H| ≤ sup

y∈X

√k(y , y)‖f ‖H.

‖µP − µQ‖∞ ≤ supy∈X

√k(y , y)‖µP − µQ‖H.

Does ‖µP − µQ‖H = 0⇒ P = Q? (More on this later)

Page 30: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

The integral probability metric between P and Q is defined as

IPM(P,Q,F) := supf∈F

∣∣∣∣∫Xf (x) dP(x)−

∫Xf (x) dQ(x)

∣∣∣∣= sup

f∈F|EPf (X )− EQf (X )| .

(Muller, 1997)

I F controls the degree of distinguishability between P and Q.

I Related to the Bayes risk of a certain classification problem (S et al.,

NIPS 2009; EJS 2012)

I Example: Suppose F = {a · x , x ∈ R : a ∈ [−1, 1]}. Then

IPM(P,Q,F) = supa∈[−1,1]

|a|∣∣∣∣∫

Rx dP(x)−

∫Rx dQ(x)

∣∣∣∣

Page 31: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

The integral probability metric between P and Q is defined as

IPM(P,Q,F) := supf∈F

∣∣∣∣∫Xf (x) dP(x)−

∫Xf (x) dQ(x)

∣∣∣∣= sup

f∈F|EPf (X )− EQf (X )| .

(Muller, 1997)

I F controls the degree of distinguishability between P and Q.

I Related to the Bayes risk of a certain classification problem (S et al.,

NIPS 2009; EJS 2012)

I Example: Suppose F = {a · x , x ∈ R : a ∈ [−1, 1]}. Then

IPM(P,Q,F) = supa∈[−1,1]

|a|∣∣∣∣∫

Rx dP(x)−

∫Rx dQ(x)

∣∣∣∣

Page 32: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

Example: Suppose F = {a · x + b · x2, x ∈ R : a2 + b2 = 1}. Then

IPM(P,Q,F) = supa2+b2=1

∣∣∣∣a ∫Rx d(P−Q) + b

∫Rx2 d(P−Q)

∣∣∣∣=

[(∫Rx d(P−Q)

)2

+

(∫Rx2 d(P−Q)

)2] 1

2

.

How? Exercise!

I The richer the F is, the finer is the resolvability of P and Q.

We will explore the relation of MMDH(P,Q) to IPM(P,Q,F).

Page 33: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

IPM(P,Q,F) := supf∈F

∣∣∣∣∫Xf (x) dP(x)−

∫Xf (x) dQ(x)

∣∣∣∣Classical results:

I F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002)

I F = unit bounded-Lipschitz ball (Dudley metric) (Dudley, 2002)

I F = {1(−∞,t] : t ∈ Rd} (Kolmogorov metric) (Muller, 1997)

I F = unit ball in bounded measurable functions (Total variation distance)(Dudley, 2002)

For all these F, IPM(P,Q,F) = 0⇒ P = Q.

(Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in anRKHS, H with bounded kernel, k . Then

MMDH(P,Q) = IPM(P,Q,F).

Proof:∫X f (x) d(P− Q)(x) = 〈f , µP − µQ〉H ≤ ‖f ‖H‖µP − µQ‖H.

Page 34: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

IPM(P,Q,F) := supf∈F

∣∣∣∣∫Xf (x) dP(x)−

∫Xf (x) dQ(x)

∣∣∣∣Classical results:

I F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002)

I F = unit bounded-Lipschitz ball (Dudley metric) (Dudley, 2002)

I F = {1(−∞,t] : t ∈ Rd} (Kolmogorov metric) (Muller, 1997)

I F = unit ball in bounded measurable functions (Total variation distance)(Dudley, 2002)

For all these F, IPM(P,Q,F) = 0⇒ P = Q.

(Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in anRKHS, H with bounded kernel, k . Then

MMDH(P,Q) = IPM(P,Q,F).

Proof:∫X f (x) d(P− Q)(x) = 〈f , µP − µQ〉H ≤ ‖f ‖H‖µP − µQ‖H.

Page 35: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Integral Probability Metric

IPM(P,Q,F) := supf∈F

∣∣∣∣∫Xf (x) dP(x)−

∫Xf (x) dQ(x)

∣∣∣∣Classical results:

I F = unit Lipschitz ball (Wasserstein distance) (Dudley, 2002)

I F = unit bounded-Lipschitz ball (Dudley metric) (Dudley, 2002)

I F = {1(−∞,t] : t ∈ Rd} (Kolmogorov metric) (Muller, 1997)

I F = unit ball in bounded measurable functions (Total variation distance)(Dudley, 2002)

For all these F, IPM(P,Q,F) = 0⇒ P = Q.

(Gretton et al., NIPS 2006, JMLR 2012; S et al., COLT 2008): F = unit ball in anRKHS, H with bounded kernel, k . Then

MMDH(P,Q) = IPM(P,Q,F).

Proof:∫X f (x) d(P− Q)(x) = 〈f , µP − µQ〉H ≤ ‖f ‖H‖µP − µQ‖H.

Page 36: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Two-Sample Problem

I Given random samples {X1, . . . ,Xm}i.i.d.∼ P and

{Y1, . . . ,Yn}i.i.d.∼ Q.

I Determine: P = Q or P 6= Q ?

I Approach: Define ρ to be a distance on probabilities

H0 : P = Q H0 : ρ(P,Q) = 0≡

H1 : P 6= Q H1 : ρ(P,Q) > 0

I If empirical ρ is

I far from zero: reject H0

I close to zero: accept H0

Page 37: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Two-Sample Problem

I Given random samples {X1, . . . ,Xm}i.i.d.∼ P and

{Y1, . . . ,Yn}i.i.d.∼ Q.

I Determine: P = Q or P 6= Q ?

I Approach: Define ρ to be a distance on probabilities

H0 : P = Q H0 : ρ(P,Q) = 0≡

H1 : P 6= Q H1 : ρ(P,Q) > 0

I If empirical ρ is

I far from zero: reject H0

I close to zero: accept H0

Page 38: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Two-Sample Problem

I Given random samples {X1, . . . ,Xm}i.i.d.∼ P and

{Y1, . . . ,Yn}i.i.d.∼ Q.

I Determine: P = Q or P 6= Q ?

I Approach: Define ρ to be a distance on probabilities

H0 : P = Q H0 : ρ(P,Q) = 0≡

H1 : P 6= Q H1 : ρ(P,Q) > 0

I If empirical ρ is

I far from zero: reject H0

I close to zero: accept H0

Page 39: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I Related to the estimation of IPM(P,Q,F).

I Recall

MMD2H(P,Q) =

∥∥∥∥∫Xk(·, x) dP(x)−

∫Xk(·, x) dQ(x)

∥∥∥∥2

H

.

I A trivial approximation: Pm := 1m

∑mi=1 δXi and Qn := 1

n

∑ni=1 δYi ,

where δx represents the Dirac measure at x .

MMD2H(Pm,Qn) =

∥∥∥∥∥ 1

m

m∑i=1

k(·,Xi )−1

n

n∑i=1

k(·,Yi )

∥∥∥∥∥2

H

=1

m2

m∑i,j=1

k(Xi ,Xj ) +1

n2

n∑i,j=1

k(Yi ,Yj )− 2∑i,j

k(Xi ,Yj )

V-statistic; biased estimator of MMD2H

Page 40: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I Related to the estimation of IPM(P,Q,F).

I Recall

MMD2H(P,Q) =

∥∥∥∥∫Xk(·, x) dP(x)−

∫Xk(·, x) dQ(x)

∥∥∥∥2

H

.

I A trivial approximation: Pm := 1m

∑mi=1 δXi and Qn := 1

n

∑ni=1 δYi ,

where δx represents the Dirac measure at x .

MMD2H(Pm,Qn) =

∥∥∥∥∥ 1

m

m∑i=1

k(·,Xi )−1

n

n∑i=1

k(·,Yi )

∥∥∥∥∥2

H

=1

m2

m∑i,j=1

k(Xi ,Xj ) +1

n2

n∑i,j=1

k(Yi ,Yj )− 2∑i,j

k(Xi ,Yj )

V-statistic; biased estimator of MMD2H

Page 41: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I IPM(Pm,Qn,F) is obtained by solving a linear program for F =Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012)

I Quality of approximation (S et al., EJS 2012)

I For F = Lipschitz and bounded Lipschitz balls,

|IPM(Pm,Qm,F)− IPM(P,Q,F)| = Op

(m−

1d+1

), d > 2

I For F = unit RKHS ball,

|MMDH(Pm,Qm)−MMDH(P,Q)| = Op

(m−

12

)

I Are there any other estimators of MMDH(P,Q) that are statisticallybetter than MMDH(Pm,Qm)? NO!! (Tolstikhin et al., 2016)

I In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

Page 42: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I IPM(Pm,Qn,F) is obtained by solving a linear program for F =Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012)

I Quality of approximation (S et al., EJS 2012)

I For F = Lipschitz and bounded Lipschitz balls,

|IPM(Pm,Qm,F)− IPM(P,Q,F)| = Op

(m−

1d+1

), d > 2

I For F = unit RKHS ball,

|MMDH(Pm,Qm)−MMDH(P,Q)| = Op

(m−

12

)

I Are there any other estimators of MMDH(P,Q) that are statisticallybetter than MMDH(Pm,Qm)? NO!! (Tolstikhin et al., 2016)

I In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

Page 43: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I IPM(Pm,Qn,F) is obtained by solving a linear program for F =Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012)

I Quality of approximation (S et al., EJS 2012)

I For F = Lipschitz and bounded Lipschitz balls,

|IPM(Pm,Qm,F)− IPM(P,Q,F)| = Op

(m−

1d+1

), d > 2

I For F = unit RKHS ball,

|MMDH(Pm,Qm)−MMDH(P,Q)| = Op

(m−

12

)

I Are there any other estimators of MMDH(P,Q) that are statisticallybetter than MMDH(Pm,Qm)? NO!! (Tolstikhin et al., 2016)

I In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

Page 44: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Why MMDH?

I IPM(Pm,Qn,F) is obtained by solving a linear program for F =Lipschitz and bounded Lipschitz balls. (S et al., EJS 2012)

I Quality of approximation (S et al., EJS 2012)

I For F = Lipschitz and bounded Lipschitz balls,

|IPM(Pm,Qm,F)− IPM(P,Q,F)| = Op

(m−

1d+1

), d > 2

I For F = unit RKHS ball,

|MMDH(Pm,Qm)−MMDH(P,Q)| = Op

(m−

12

)

I Are there any other estimators of MMDH(P,Q) that are statisticallybetter than MMDH(Pm,Qm)? NO!! (Tolstikhin et al., 2016)

I In practice? YES!! (Krikamol et al., JMLR 2016; S, Bernoulli 2016)

Page 45: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Beware of Pitfalls

I There are many other distances on probabilities:

I Total variation distance

I Hellinger distance

I Kullback-Leibler divergence and its variants

I Fisher divergence ...

I Estimating these distances is both computationally and statisticallydifficult.

I MMDH is computationally simpler and appears statistically powerfulwith no curse of dimensionality. In fact, it is NOT statisticallypowerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016)

I Recall: MMDH is based on µP which is a smoothed version of P. Even though Pand Q can be distinguished (coming up!!) based on µP and µQ, thedistinguishability is weak compared to that of the above distances. (S et al.,JMLR 2010; S, Bernoulli, 2016)

NO FREE LUNCH!!

Page 46: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Beware of Pitfalls

I There are many other distances on probabilities:

I Total variation distance

I Hellinger distance

I Kullback-Leibler divergence and its variants

I Fisher divergence ...

I Estimating these distances is both computationally and statisticallydifficult.

I MMDH is computationally simpler and appears statistically powerfulwith no curse of dimensionality. In fact, it is NOT statisticallypowerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016)

I Recall: MMDH is based on µP which is a smoothed version of P. Even though Pand Q can be distinguished (coming up!!) based on µP and µQ, thedistinguishability is weak compared to that of the above distances. (S et al.,JMLR 2010; S, Bernoulli, 2016)

NO FREE LUNCH!!

Page 47: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Beware of Pitfalls

I There are many other distances on probabilities:

I Total variation distance

I Hellinger distance

I Kullback-Leibler divergence and its variants

I Fisher divergence ...

I Estimating these distances is both computationally and statisticallydifficult.

I MMDH is computationally simpler and appears statistically powerfulwith no curse of dimensionality. In fact, it is NOT statisticallypowerful. (Ramdas et al., AAAI 2015; S, Bernoulli, 2016)

I Recall: MMDH is based on µP which is a smoothed version of P. Even though Pand Q can be distinguished (coming up!!) based on µP and µQ, thedistinguishability is weak compared to that of the above distances. (S et al.,JMLR 2010; S, Bernoulli, 2016)

NO FREE LUNCH!!

Page 48: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

So far . . .

P 7→ µP :=

∫Xk(·, x) dP(x)

MMDH(P,Q) = ‖µP − µQ‖H

I Computation

I Estimation

When is P 7→ µP one-to-one?, i.e., MMDH(P,Q) = 0 ⇒ P = Q?

Page 49: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernelk is said to be characteristic if

MMDH(P,Q) = 0 ⇔ P = Q

for any P and Q.

Not all kernels are characteristic.

I Example: If k(x , y) = c > 0, ∀ x , y ∈ X , then

µP =

∫X

k(·, x) dP(x) = c, µQ = c

and MMDH(P,Q) = 0, ∀P,Q.

I Example: Let k(x , y) = xy , x , y ∈ R. Then

MMDH(P,Q) = |EP[X ]− EQ[X ]|.

Characteristic for Bernoulli’s but not for all P and Q.

I Example: Let k(x , y) = (1 + xy)2, x , y ∈ R. Then

MMD2H(P,Q) = 2(EP[X ]− EQ[X ])2 + (EP[X 2]− EQ[X 2]).

Characteristic for Gaussian’s but not for all P and Q.

Page 50: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernelk is said to be characteristic if

MMDH(P,Q) = 0 ⇔ P = Q

for any P and Q.

Not all kernels are characteristic.

I Example: If k(x , y) = c > 0, ∀ x , y ∈ X , then

µP =

∫X

k(·, x) dP(x) = c, µQ = c

and MMDH(P,Q) = 0, ∀P,Q.

I Example: Let k(x , y) = xy , x , y ∈ R. Then

MMDH(P,Q) = |EP[X ]− EQ[X ]|.

Characteristic for Bernoulli’s but not for all P and Q.

I Example: Let k(x , y) = (1 + xy)2, x , y ∈ R. Then

MMD2H(P,Q) = 2(EP[X ]− EQ[X ])2 + (EP[X 2]− EQ[X 2]).

Characteristic for Gaussian’s but not for all P and Q.

Page 51: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernelk is said to be characteristic if

MMDH(P,Q) = 0 ⇔ P = Q

for any P and Q.

Not all kernels are characteristic.

I Example: If k(x , y) = c > 0, ∀ x , y ∈ X , then

µP =

∫X

k(·, x) dP(x) = c, µQ = c

and MMDH(P,Q) = 0, ∀P,Q.

I Example: Let k(x , y) = xy , x , y ∈ R. Then

MMDH(P,Q) = |EP[X ]− EQ[X ]|.

Characteristic for Bernoulli’s but not for all P and Q.

I Example: Let k(x , y) = (1 + xy)2, x , y ∈ R. Then

MMD2H(P,Q) = 2(EP[X ]− EQ[X ])2 + (EP[X 2]− EQ[X 2]).

Characteristic for Gaussian’s but not for all P and Q.

Page 52: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernelk is said to be characteristic if

MMDH(P,Q) = 0 ⇔ P = Q

for any P and Q.

Not all kernels are characteristic.

I Example: If k(x , y) = c > 0, ∀ x , y ∈ X , then

µP =

∫X

k(·, x) dP(x) = c, µQ = c

and MMDH(P,Q) = 0, ∀P,Q.

I Example: Let k(x , y) = xy , x , y ∈ R. Then

MMDH(P,Q) = |EP[X ]− EQ[X ]|.

Characteristic for Bernoulli’s but not for all P and Q.

I Example: Let k(x , y) = (1 + xy)2, x , y ∈ R. Then

MMD2H(P,Q) = 2(EP[X ]− EQ[X ])2 + (EP[X 2]− EQ[X 2]).

Characteristic for Gaussian’s but not for all P and Q.

Page 53: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernels on Rd

I Translation invariant kernel: k(x , y) = ψ(x − y), x , y ∈ Rd ;bounded and continuous.

I Bochner’s theorem:

ψ(x) =

∫Rd

e√−1〈x,ω〉2 dΛ(ω), x ∈ Rd ,

where Λ is a non-negative finite Borel measure on Rd .

Then, k is characteristic ⇔ supp(Λ) = Rd . (S et al., COLT 2008; JMLR,

2010)

I Corollary: Compactly supported ψ are characteristic (S et al., COLT

2008; JMLR, 2010).

Key Idea: Fourier representation of MMDH

Page 54: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Fourier Representation of MMD2H

MMD2H(P,Q) =

∫Rd

|ϕP(ω)− ϕQ(ω)|2 dΛ(ω)

where ϕP is the characteristic function of P.

Proof:

MMD2H(P,Q) =

∫Rd

∫Rdψ(x − y) d(P− Q)(x) d(P− Q)(y)

(∗)=

∫Rd

∫Rd

∫Rd

e−√−1〈x−y,ω〉 dΛ(ω) d(P− Q)(x) d(P− Q)(y)

(†)=

∫Rd

∫Rd

e−√−1〈x,ω〉 d(P− Q)(x)

∫Rd

e√−1〈y,ω〉 d(P− Q)(y) dΛ(ω)

=

∫Rd|ϕP(ω)− ϕQ(ω)|2 dΛ(ω),

where Bochner’s theorem is used in (∗) and Fubini’s theorem in (†).

I Suppose Λ = 1, i.e., uniform on Rd (!!). Then MMDH(P,Q) is theL2 distance between the densities (if they exist) of P and Q.

Page 55: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Fourier Representation of MMD2H

MMD2H(P,Q) =

∫Rd

|ϕP(ω)− ϕQ(ω)|2 dΛ(ω)

where ϕP is the characteristic function of P.

Proof:

MMD2H(P,Q) =

∫Rd

∫Rdψ(x − y) d(P− Q)(x) d(P− Q)(y)

(∗)=

∫Rd

∫Rd

∫Rd

e−√−1〈x−y,ω〉 dΛ(ω) d(P− Q)(x) d(P− Q)(y)

(†)=

∫Rd

∫Rd

e−√−1〈x,ω〉 d(P− Q)(x)

∫Rd

e√−1〈y,ω〉 d(P− Q)(y) dΛ(ω)

=

∫Rd|ϕP(ω)− ϕQ(ω)|2 dΛ(ω),

where Bochner’s theorem is used in (∗) and Fubini’s theorem in (†).

I Suppose Λ = 1, i.e., uniform on Rd (!!). Then MMDH(P,Q) is theL2 distance between the densities (if they exist) of P and Q.

Page 56: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernels on Rd

Proof:

I Suppose supp(Λ) = Rd . Then

MMD2H(P,Q) = 0⇒

∫Rd

|ϕP(ω)− ϕQ(ω)|2 dΛ(ω) = 0⇒ ϕP = ϕQ a.e.

But characteristic functions are uniformly continuous and soϕP = ϕQ which implies P = Q.

I Suppose supp(Λ) ( Rd . Then there exists an open set U ( Rd suchthat Λ(U) = 0. Construct P and Q such that ϕP and ϕQ differ onlyin U, i.e., MMDH(P,Q) > 0.

I If ψ is compactly supported, its Fourier transform is analytic, i.e.,cannot vanish on an interval.

Page 57: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernels on Rd

Proof:

I Suppose supp(Λ) = Rd . Then

MMD2H(P,Q) = 0⇒

∫Rd

|ϕP(ω)− ϕQ(ω)|2 dΛ(ω) = 0⇒ ϕP = ϕQ a.e.

But characteristic functions are uniformly continuous and soϕP = ϕQ which implies P = Q.

I Suppose supp(Λ) ( Rd . Then there exists an open set U ( Rd suchthat Λ(U) = 0. Construct P and Q such that ϕP and ϕQ differ onlyin U, i.e., MMDH(P,Q) > 0.

I If ψ is compactly supported, its Fourier transform is analytic, i.e.,cannot vanish on an interval.

Page 58: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Characteristic Kernels on Rd

Proof:

I Suppose supp(Λ) = Rd . Then

MMD2H(P,Q) = 0⇒

∫Rd

|ϕP(ω)− ϕQ(ω)|2 dΛ(ω) = 0⇒ ϕP = ϕQ a.e.

But characteristic functions are uniformly continuous and soϕP = ϕQ which implies P = Q.

I Suppose supp(Λ) ( Rd . Then there exists an open set U ( Rd suchthat Λ(U) = 0. Construct P and Q such that ϕP and ϕQ differ onlyin U, i.e., MMDH(P,Q) > 0.

I If ψ is compactly supported, its Fourier transform is analytic, i.e.,cannot vanish on an interval.

Page 59: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

Page 60: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

F→

F→

Page 61: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

F→

F→

Characteristic function difference

−30 −20 −10 0 10 20 300

0.05

0.1

0.15

0.2

ω

|φP −

φQ

|

Page 62: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

Gaussian kernel

|ϕP − ϕQ|

Page 63: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

Characteristic

Picture credit: A. Gretton

Page 64: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

Sinc kernel

|ϕP − ϕQ|

Page 65: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

NOT characteristic

Picture credit: A. Gretton

Page 66: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

B-Spline kernel

|ϕP − ϕQ|

Page 67: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

???

Page 68: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Translation Invariant Kernels on Rd

MMDH(P,Q) = ‖ϕP − ϕQ‖L2(Rd ,Λ)

I Example: P differs from Q at (roughly) one frequency

Characteristic

Picture credit: A. Gretton

Page 69: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Caution

Chararacteristic property relates class of kernels and class of probabilities.

Σ := supp(Λ)

(S et al., COLT 2008; JMLR 2010)

Page 70: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability
Page 71: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Measuring (In)Dependence

I Let X and Y be Gaussian random variables on R. Then

X and Y are independent ⇔ Cov(X ,Y ) = E(XY )−E(X )E(Y ) = 0

I In general, Cov(X ,Y ) = 0 ; X ⊥ Y .

I Covariance captures the linear relationship between X and Y .

I Feature space view point: How about Cov(Φ(X ),Ψ(Y ))?

I Suppose

Φ(X ) = (1,X ,X 2) and Ψ(Y ) = (1,Y ,Y 2,Y 3).

Then Cov(Φ(X ),Φ(Y )) captures Cov(X i ,Y j ) for i ∈ {0, 1, 2} andj ∈ {0, 1, 2, 3}.

Page 72: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Measuring (In)Dependence

I Let X and Y be Gaussian random variables on R. Then

X and Y are independent ⇔ Cov(X ,Y ) = E(XY )−E(X )E(Y ) = 0

I In general, Cov(X ,Y ) = 0 ; X ⊥ Y .

I Covariance captures the linear relationship between X and Y .

I Feature space view point: How about Cov(Φ(X ),Ψ(Y ))?

I Suppose

Φ(X ) = (1,X ,X 2) and Ψ(Y ) = (1,Y ,Y 2,Y 3).

Then Cov(Φ(X ),Φ(Y )) captures Cov(X i ,Y j ) for i ∈ {0, 1, 2} andj ∈ {0, 1, 2, 3}.

Page 73: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Measuring (In)Dependence

I Characterization of independence:

X ⊥ Y ⇔ Cov(f (X ), g(Y )) = 0, ∀measurable functions f and g .

I Dependence measure:

supf ,g|Cov(f (X ), g(Y ))| = sup

f ,g|E[f (X )g(Y )]− E[f (X )]E[g(Y )]|

Similar to the IPM between PXY and PXPY .

I Restricting functions in RKHS: (constrained covariance)

COCO(PXY ;HX ,HY ) := sup‖f ‖HX

=1

‖g‖HY=1

|E[f (X )g(Y )]− E[f (X )]E[g(Y )]| .

(Gretton et al., AISTATS 2005, JMLR 2005)

Page 74: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Measuring (In)Dependence

I Characterization of independence:

X ⊥ Y ⇔ Cov(f (X ), g(Y )) = 0, ∀measurable functions f and g .

I Dependence measure:

supf ,g|Cov(f (X ), g(Y ))| = sup

f ,g|E[f (X )g(Y )]− E[f (X )]E[g(Y )]|

Similar to the IPM between PXY and PXPY .

I Restricting functions in RKHS: (constrained covariance)

COCO(PXY ;HX ,HY ) := sup‖f ‖HX

=1

‖g‖HY=1

|E[f (X )g(Y )]− E[f (X )]E[g(Y )]| .

(Gretton et al., AISTATS 2005, JMLR 2005)

Page 75: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Covariance Operator

Let kX and kY be the r.k.’s of HX and HY respectively. Then

I E[f (X )] = 〈f , µPX〉HX

and E[g(Y )] = 〈g , µPY〉HY

I

E[f (X )]E[g(Y )] = 〈f , µPX〉HX〈g , µPY

〉HY

= 〈f ⊗ g , µPX⊗ µPY

〉HX⊗HY

= 〈f , (µPX⊗ µPY

)g〉HX

= 〈g , (µPY⊗ µPX

)f 〉HYI

E[f (X )g(Y )] = E[〈f , kX (·,X )〉HX〈g , kY (·,Y )〉HY

]

= E[〈f ⊗ g , kX (·,X )⊗ kY (·,Y )〉HX⊗HY]

= E[〈f , (kX (·,X )⊗ kY (·,Y ))g〉HX]

= E[〈g , (kY (·,Y )⊗ kX (·,X ))f 〉HY]

Page 76: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Covariance Operator

Let kX and kY be the r.k.’s of HX and HY respectively. Then

I E[f (X )] = 〈f , µPX〉HX

and E[g(Y )] = 〈g , µPY〉HY

I

E[f (X )]E[g(Y )] = 〈f , µPX〉HX〈g , µPY

〉HY

= 〈f ⊗ g , µPX⊗ µPY

〉HX⊗HY

= 〈f , (µPX⊗ µPY

)g〉HX

= 〈g , (µPY⊗ µPX

)f 〉HYI

E[f (X )g(Y )] = E[〈f , kX (·,X )〉HX〈g , kY (·,Y )〉HY

]

= E[〈f ⊗ g , kX (·,X )⊗ kY (·,Y )〉HX⊗HY]

= E[〈f , (kX (·,X )⊗ kY (·,Y ))g〉HX]

= E[〈g , (kY (·,Y )⊗ kX (·,X ))f 〉HY]

Page 77: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Covariance Operator

I Assuming E√kX (X ,X )kY (Y ,Y ) <∞, we obtain

E[f (X )g(Y )] = 〈f ,E[kX (·,X )⊗ kY (·,Y )]g〉HX

= 〈g ,E[kY (·,Y )⊗ kX (·,X )]f 〉HY

I

Cov(f (X ), g(Y )) = 〈f ,CX Y g〉HX= 〈g ,CY X f 〉HY

whereCX Y := E[kX (·,X )⊗ kY (·,Y )]− µPX

⊗ µPY

is a cross-covariance operator from HY to HX and CY X = C∗X Y .

Compare to the feature space view point with canonical feature maps

Page 78: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence Measures

I

COCO(PXY ;HX ,HY ) = sup‖f ‖HX

=1

‖g‖HY=1

| 〈f ,CX Y g〉HX|

= ‖CX Y ‖op = ‖CY X‖op,

which is the maximum singular value of CX Y .

I Choosing kX (·,X ) = 〈·,X 〉2 and kY (·,Y ) = 〈·,Y 〉2, for Gaussiandistributions,

X ⊥ Y ⇔ CY X = 0

I In general,

X ⊥ Y?⇔ CY X = 0.

Page 79: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence Measures

I

COCO(PXY ;HX ,HY ) = sup‖f ‖HX

=1

‖g‖HY=1

| 〈f ,CX Y g〉HX|

= ‖CX Y ‖op = ‖CY X‖op,

which is the maximum singular value of CX Y .

I Choosing kX (·,X ) = 〈·,X 〉2 and kY (·,Y ) = 〈·,Y 〉2, for Gaussiandistributions,

X ⊥ Y ⇔ CY X = 0

I In general,

X ⊥ Y?⇔ CY X = 0.

Page 80: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence Measures

I How about we consider other singular values?

I How about ‖CY X‖2HS , which is the sum of squared singular values of

CY X ?

Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., ALT

2005, JMLR 2005)

I ‖CY X‖op ≤ ‖CY X‖HS

Page 81: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence MeasuresI

COCO(PXY ;HX ,HY ) := sup‖f ‖HX

=1

‖g‖HY=1

|E[f (X )g(Y )]− E[f (X )]E[g(Y )]| .

I How about we use different constraint, i.e., ‖f ⊗ g‖HX⊗HY≤ 1?

sup‖f⊗g‖HX⊗HY

≤1

Cov(f (X ), g(Y )) = sup‖f⊗g‖HX⊗HY

≤1

〈f ,CX Y g〉HX

= sup‖f⊗g‖HX⊗HY

≤1

〈f ⊗ g ,CX Y 〉HX⊗HY

= ‖CX Y ‖HX⊗HY= ‖CX Y ‖HS

I

‖CX Y ‖HX⊗HY= ‖E[kX (·,X )⊗ kY (·,Y )]− µPX

⊗ µPX‖HX⊗HY

=

∥∥∥∥∫ kX (·,X )⊗ kY (·,Y ) d(PX Y − PX × PY )

∥∥∥∥HX⊗HY

= MMDHX⊗HY(PX Y ,PX × PY )

Page 82: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence MeasuresI

COCO(PXY ;HX ,HY ) := sup‖f ‖HX

=1

‖g‖HY=1

|E[f (X )g(Y )]− E[f (X )]E[g(Y )]| .

I How about we use different constraint, i.e., ‖f ⊗ g‖HX⊗HY≤ 1?

sup‖f⊗g‖HX⊗HY

≤1

Cov(f (X ), g(Y )) = sup‖f⊗g‖HX⊗HY

≤1

〈f ,CX Y g〉HX

= sup‖f⊗g‖HX⊗HY

≤1

〈f ⊗ g ,CX Y 〉HX⊗HY

= ‖CX Y ‖HX⊗HY= ‖CX Y ‖HS

I

‖CX Y ‖HX⊗HY= ‖E[kX (·,X )⊗ kY (·,Y )]− µPX

⊗ µPX‖HX⊗HY

=

∥∥∥∥∫ kX (·,X )⊗ kY (·,Y ) d(PX Y − PX × PY )

∥∥∥∥HX⊗HY

= MMDHX⊗HY(PX Y ,PX × PY )

Page 83: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence MeasuresI

COCO(PXY ;HX ,HY ) := sup‖f ‖HX

=1

‖g‖HY=1

|E[f (X )g(Y )]− E[f (X )]E[g(Y )]| .

I How about we use different constraint, i.e., ‖f ⊗ g‖HX⊗HY≤ 1?

sup‖f⊗g‖HX⊗HY

≤1

Cov(f (X ), g(Y )) = sup‖f⊗g‖HX⊗HY

≤1

〈f ,CX Y g〉HX

= sup‖f⊗g‖HX⊗HY

≤1

〈f ⊗ g ,CX Y 〉HX⊗HY

= ‖CX Y ‖HX⊗HY= ‖CX Y ‖HS

I

‖CX Y ‖HX⊗HY= ‖E[kX (·,X )⊗ kY (·,Y )]− µPX

⊗ µPX‖HX⊗HY

=

∥∥∥∥∫ kX (·,X )⊗ kY (·,Y ) d(PX Y − PX × PY )

∥∥∥∥HX⊗HY

= MMDHX⊗HY(PX Y ,PX × PY )

Page 84: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Dependence Measures

I HX ⊗HY is an RKHS with kernel kXkY .

I If kXkY is characteristic, then

‖CX Y ‖HX⊗HY= 0⇔ PX Y = PX × PY ⇔ X ⊥ Y

I If kX and kY are characteristic, then

‖CX Y ‖HS = 0⇔ X ⊥ Y .

(Gretton, 2015)

I Using the reproducing property,

‖CX Y ‖2HS = EX Y EX ′Y ′kX (X ,X ′)kY (Y ,Y ′)

+EX X ′kX (X ,X ′)EYY ′kY (Y ,Y ′)

−2 · EX ′Y ′ [EXkX (X ,X ′)EY kY (Y ,Y ′)]

I Can be estimated using a V-statistic (empirical sums).

Page 85: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Applications

I Two-sample testing

I Independence testing

I Conditional independence testing

I Supervised dimensionality reduction

I Kernel Bayes rule (filtering, prediction and smoothing)

I Kernel CCA,....

Review paper (Muandet et al., 2016)

Page 86: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Application: Two-Sample Testing

Page 87: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Two-Sample Problem

I Given random samples {X1, . . . ,Xm}i.i.d.∼ P and

{Y1, . . . ,Yn}i.i.d.∼ Q.

I Determine: P = Q or P 6= Q ?

I Approach:

H0 : P = Q H0 : MMDH(P,Q) = 0≡

H1 : P 6= Q H1 : MMDH(P,Q) > 0

I If MMD2H(Pm,Qn) is

I far from zero: reject H0

I close to zero: accept H0

Page 88: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Type-I and Type-II Errors

I Given P = Q, want threshold or critical value t1−α such thatPrH0 (MMD2

H(Pm,Qn) > t1−α) ≤ α.

Page 89: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Large Deviation Bounds

I Given P = Q, want threshold t such thatPrH0 (MMD2

H(Pm,Qn) > t) ≤ α.

I We showed that (S et al., EJS 2012)

Pr( ∣∣MMD2

H(Pm,Qn)−MMD2H(P,Q)

∣∣≥√

2(m+n)mn

(1 +

√2 log 1

α

))≤ α.

I α-level test: Accept H0 if

MMD2H(Pm,Qn) <

√2(m + n)

mn

(1 +

√2 log

1

α

)

Otherwise reject.

Too conservative!!

Page 90: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Large Deviation Bounds

I Given P = Q, want threshold t such thatPrH0 (MMD2

H(Pm,Qn) > t) ≤ α.

I We showed that (S et al., EJS 2012)

Pr( ∣∣MMD2

H(Pm,Qn)−MMD2H(P,Q)

∣∣≥√

2(m+n)mn

(1 +

√2 log 1

α

))≤ α.

I α-level test: Accept H0 if

MMD2H(Pm,Qn) <

√2(m + n)

mn

(1 +

√2 log

1

α

)

Otherwise reject.

Too conservative!!

Page 91: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006,

JMLR 2012)

Unbiased estimator of MMD2H(P,Q): U-statistic

MMD2H :=

1

m(m − 1)

m∑i 6=j

k(Xi ,Xj ) + k(Yi ,Yj )− k(Xi ,Yj )− k(Xj ,Yi )︸ ︷︷ ︸h((Xi ,Yi ),(Xj ,Yj ))

I Under H0,

mMMD2H

w→∞∑

i=1

λi

(θ2

i − 2)

as n→∞,

where θi ∼ N (0, 2) i.i.d., and λi are solutions to∫Xk(x , y)︸ ︷︷ ︸centered

ψi (x) dP(x) = λiψi (y)

I Consistent (Type-II error goes to zero): Under H1,

√m(MMD2

H −MMD2H(P,Q)

)w→ N (0, σ2

h) as n→∞.

Page 92: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006,

JMLR 2012)

Unbiased estimator of MMD2H(P,Q): U-statistic

MMD2H :=

1

m(m − 1)

m∑i 6=j

k(Xi ,Xj ) + k(Yi ,Yj )− k(Xi ,Yj )− k(Xj ,Yi )︸ ︷︷ ︸h((Xi ,Yi ),(Xj ,Yj ))

I Under H0,

mMMD2H

w→∞∑

i=1

λi

(θ2

i − 2)

as n→∞,

where θi ∼ N (0, 2) i.i.d., and λi are solutions to∫Xk(x , y)︸ ︷︷ ︸centered

ψi (x) dP(x) = λiψi (y)

I Consistent (Type-II error goes to zero): Under H1,

√m(MMD2

H −MMD2H(P,Q)

)w→ N (0, σ2

h) as n→∞.

Page 93: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006,

JMLR 2012)

Unbiased estimator of MMD2H(P,Q): U-statistic

MMD2H :=

1

m(m − 1)

m∑i 6=j

k(Xi ,Xj ) + k(Yi ,Yj )− k(Xi ,Yj )− k(Xj ,Yi )︸ ︷︷ ︸h((Xi ,Yi ),(Xj ,Yj ))

I Under H0,

mMMD2H

w→∞∑

i=1

λi

(θ2

i − 2)

as n→∞,

where θi ∼ N (0, 2) i.i.d., and λi are solutions to∫Xk(x , y)︸ ︷︷ ︸centered

ψi (x) dP(x) = λiψi (y)

I Consistent (Type-II error goes to zero): Under H1,

√m(MMD2

H −MMD2H(P,Q)

)w→ N (0, σ2

h) as n→∞.

Page 94: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test: Asymptotic Distribution (Gretton et al., NIPS 2006,

JMLR 2012)

I α-level test: Estimate 1− α quantile of the null distribution usingbootstrap.

Computationally intensive!!

Picture credit: A. Gretton

Page 95: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Statistical Test Without Bootstrap (Gretton et al., NIPS 2009)

I Estimate the eigenvalues, λi from combined samples

I Define Z := (X1, . . . ,Xm,Y1, . . . ,Ym)

I Kij := k(Zi ,Zj )

I Compute the eigenvalues, λi of

K = HKH

where H = I− 12m

12m1T2m

I α-level test: Compute the 1− α quantile of the distributionassociated with

2m∑i=1

λi

(θ2

i − 2)

I Test is asymptotically α-level consistent

Page 96: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Experiments (Gretton et al., NIPS 2009)

I Comparison example: Canadian Hansard corpus (agriculture,fisheries and immigration)

I Samples: 5-line extracts

I Kernel: k-spectrum kernel with k = 10

I Sample size: 10

I Repetitions: 300

I Compute MMD2H

k-spectrum kernel: average Type II error 0 (α = 0.05)

Bag of words kernel: average Type II error 0.18

First ever test on structured data

Page 97: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernel

Page 98: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels

Let X = Rd . Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ2 .

I MMDHσ is a function of σ.

I So MMDHσis a family of metrics. Which one should we use in

practice?

I Note that MMDHσ→ 0 as σ → 0 or σ →∞.

Therefore, the kernel choice is very critical in applications.

Heuristics:

I Median: σ = median(‖X ∗i − X ∗j ‖2 : i 6= j , i , j = 1, . . . ,m

)where

X ∗ = ((Xi )i , (Yi )i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012).

I Choose the test statistic to be MMDHσ∗ (Pm,Qm) where

σ∗ = arg maxσ∈(0,∞)

MMDHσ(Pm,Qm)

(S et al., NIPS 2009)

Page 99: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels

Let X = Rd . Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ2 .

I MMDHσ is a function of σ.

I So MMDHσis a family of metrics. Which one should we use in

practice?

I Note that MMDHσ→ 0 as σ → 0 or σ →∞.

Therefore, the kernel choice is very critical in applications.

Heuristics:

I Median: σ = median(‖X ∗i − X ∗j ‖2 : i 6= j , i , j = 1, . . . ,m

)where

X ∗ = ((Xi )i , (Yi )i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012).

I Choose the test statistic to be MMDHσ∗ (Pm,Qm) where

σ∗ = arg maxσ∈(0,∞)

MMDHσ(Pm,Qm)

(S et al., NIPS 2009)

Page 100: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels

Let X = Rd . Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ2 .

I MMDHσ is a function of σ.

I So MMDHσis a family of metrics. Which one should we use in

practice?

I Note that MMDHσ→ 0 as σ → 0 or σ →∞.

Therefore, the kernel choice is very critical in applications.

Heuristics:

I Median: σ = median(‖X ∗i − X ∗j ‖2 : i 6= j , i , j = 1, . . . ,m

)where

X ∗ = ((Xi )i , (Yi )i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012).

I Choose the test statistic to be MMDHσ∗ (Pm,Qm) where

σ∗ = arg maxσ∈(0,∞)

MMDHσ(Pm,Qm)

(S et al., NIPS 2009)

Page 101: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels

Let X = Rd . Suppose k is a Gaussian kernel, kσ(x , y) = e−‖x−y‖2

22σ2 .

I MMDHσ is a function of σ.

I So MMDHσis a family of metrics. Which one should we use in

practice?

I Note that MMDHσ→ 0 as σ → 0 or σ →∞.

Therefore, the kernel choice is very critical in applications.

Heuristics:

I Median: σ = median(‖X ∗i − X ∗j ‖2 : i 6= j , i , j = 1, . . . ,m

)where

X ∗ = ((Xi )i , (Yi )i ) (Gretton et al., NIPS 2006, NIPS 2009, JMLR 2012).

I Choose the test statistic to be MMDHσ∗ (Pm,Qm) where

σ∗ = arg maxσ∈(0,∞)

MMDHσ(Pm,Qm)

(S et al., NIPS 2009)

Page 102: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Classes of Characteristic Kernels (S et al., NIPS 2009)

More generally, we use

MMD(P,Q) := supk∈K

MMDHk(P,Q).

Examples for K :

I Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.

I Klin := {kλ =∑`

i=1 λiki |kλ is pd,∑`

i=1 λi = 1}.

I Kcon := {kλ =∑`

i=1 λiki |λi ≥ 0,∑`

i=1 λi = 1}.

Test:

I α-level test: Estimate 1− α quantile of the null distribution ofMMD(Pm,Qm) using bootstrap.

I Test consistency: Based on the functional central limit theorem forU-processes indexed by VC -subgraph K.

Computational disadvantage!!

Page 103: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Classes of Characteristic Kernels (S et al., NIPS 2009)

More generally, we use

MMD(P,Q) := supk∈K

MMDHk(P,Q).

Examples for K :

I Kg := {e−σ‖x−y‖22 , x , y ∈ Rd : σ ∈ R+}.

I Klin := {kλ =∑`

i=1 λiki |kλ is pd,∑`

i=1 λi = 1}.

I Kcon := {kλ =∑`

i=1 λiki |λi ≥ 0,∑`

i=1 λi = 1}.

Test:

I α-level test: Estimate 1− α quantile of the null distribution ofMMD(Pm,Qm) using bootstrap.

I Test consistency: Based on the functional central limit theorem forU-processes indexed by VC -subgraph K.

Computational disadvantage!!

Page 104: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Experiments

I q = N (0, σ2q).

I p(x) = q(x)(1 + sin νx).

−5 0 50

0.1

0.2

0.3

x

q(x

)

ν = 0

−5 0 50

0.1

0.2

0.3

0.4

x

p(x

)

ν = 2

−5 0 50

0.1

0.2

0.3

0.4

x

p(x

)

ν = 7.5

I k(x , y) = exp(−(x − y)2/σ).

I Test statistics: MMD(Pm,Qm) and MMDHσ(Pm,Qm) for various σ.

Page 105: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Experiments

MMD(P,Q)

0.5 0.75 1 1.25 1.5

0

2

4

5

6

ν

Err

or

(in %

)

Type−I error

Type−II error

Page 106: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Experiments

MMDHσ(P,Q)

−3 −2 −1 0 1 2 3 4 5 6

5

10

15

20

25

log σ

Type−

I err

or

(in %

)

ν=0.5

ν=0.75

ν=1.0

ν=1.25

ν=1.5

−3 −2 −1 0 1 2 3 4 5 60

50

100

log σ

Type−

II e

rror

(in %

)

ν=0.5

ν=0.75

ν=1.0

ν=1.25

ν=1.5

0.5 0.75 1 1.25 1.50

1

2

3

log σ

ν

0.5 0.75 1 1.25 1.5

8

9

10

11

Media

n a

s σ

ν

Page 107: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels (Gretton et al., NIPS 2012)

I Choose a kernel that minimizes the Type-II error for a given Type-Ierror:

k∗ ∈ arg infk∈K:TypeI (k)≤α

TypeII (k).

I Not easy to compute with the asymptotic distributions of the

U-statistic, MMD2Hk

(Pm,Qm).

I Modified statistic: Average of U-statistics computed on independentblocks of size 2.

˜MMD2Hk

(Pm,Qm)=2

m

m/2∑i=1

k(X2i−1,X2i ) + k(Y2i−1,Y2i )

−k(X2i−1,Y2i )− k(Y2i−1,X2i )︸ ︷︷ ︸hk (Zi )

,

where Zi = (X2i−1,X2i ,Y2i−1,Y2i ).

I Recall

MMD2H :=

1

m(m − 1)

m∑i 6=j

k(Xi ,Xj ) + k(Yi ,Yj )− k(Xi ,Yj )− k(Xj ,Yi )︸ ︷︷ ︸h((Xi ,Yi ),(Xj ,Yj ))

Page 108: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Choice of Characteristic Kernels (Gretton et al., NIPS 2012)

I Choose a kernel that minimizes the Type-II error for a given Type-Ierror:

k∗ ∈ arg infk∈K:TypeI (k)≤α

TypeII (k).

I Not easy to compute with the asymptotic distributions of the

U-statistic, MMD2Hk

(Pm,Qm).

I Modified statistic: Average of U-statistics computed on independentblocks of size 2.

˜MMD2Hk

(Pm,Qm)=2

m

m/2∑i=1

k(X2i−1,X2i ) + k(Y2i−1,Y2i )

−k(X2i−1,Y2i )− k(Y2i−1,X2i )︸ ︷︷ ︸hk (Zi )

,

where Zi = (X2i−1,X2i ,Y2i−1,Y2i ).

I Recall

MMD2H :=

1

m(m − 1)

m∑i 6=j

k(Xi ,Xj ) + k(Yi ,Yj )− k(Xi ,Yj )− k(Xj ,Yi )︸ ︷︷ ︸h((Xi ,Yi ),(Xj ,Yj ))

Page 109: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Modified Statistic

Advantages:

I MMD2H is computable in O(m) while MMD2

H requires O(m2)computations.

I Under H0,√mMMD2

Hk(Pm,Qm)

w→ N (0, 2σ2hk

),

where σ2hk

= EZh2k (Z )− (EZhk (Z ))2 assuming 0 < EZh

2k (Z ) <∞.

I The asymptotic distribution is normal as against weighted sum ofinfinite χ2. Therefore, the test threshold is easy to compute.

Disadvantages:

I Larger variance

I Smaller power

Page 110: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Modified Statistic

Advantages:

I MMD2H is computable in O(m) while MMD2

H requires O(m2)computations.

I Under H0,√mMMD2

Hk(Pm,Qm)

w→ N (0, 2σ2hk

),

where σ2hk

= EZh2k (Z )− (EZhk (Z ))2 assuming 0 < EZh

2k (Z ) <∞.

I The asymptotic distribution is normal as against weighted sum ofinfinite χ2. Therefore, the test threshold is easy to compute.

Disadvantages:

I Larger variance

I Smaller power

Page 111: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Type-I and Type-II ErrorsI Test threshold: For a given k and α,

tk,1−α =√

2σhkΦ−1

N (1− α)

where ΦN is the cdf of N (0, 1).

I Type-II error:

ΦN

(Φ−1

N (1− α)−MMD2

Hk(P,Q)

√m

√2σhk

)

Page 112: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Best Kernel: Minimizes Type-II Error

I Since ΦN is a strictly increasing function, the Type-II error is

minimized by maximizingMMD2

Hk(P,Q)

σhk.

I Optimal kernel:

k∗ ∈ arg supk∈K

MMD2Hk

(P,Q)

σhk

.

I Since MMD2Hk

and σhkdepend on unknown P and Q, we split the

data into train and test data to estimate k∗ on the train data as k∗

and evaluate the threshold tk∗,1−α on the test data.

Page 113: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Data-Dependent Kernel

I Train data: MMD2Hk

and σhk.

I Define

k∗ ∈ arg supk∈K

MMD2Hk

σhk+ λm

for some λm → 0 as m→∞.

I Test data: ˜MMD2Hk∗

, σhk∗and tk∗,1−α.

I If ˜MMD2Hk∗

> tk∗,1−α, reject H0, else accept.

Similar results are recently obtained for MMD2Hk

(Sutherland et al., ICLR

2017)

Page 114: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Data-Dependent Kernel

I Train data: MMD2Hk

and σhk.

I Define

k∗ ∈ arg supk∈K

MMD2Hk

σhk+ λm

for some λm → 0 as m→∞.

I Test data: ˜MMD2Hk∗

, σhk∗and tk∗,1−α.

I If ˜MMD2Hk∗

> tk∗,1−α, reject H0, else accept.

Similar results are recently obtained for MMD2Hk

(Sutherland et al., ICLR

2017)

Page 115: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Learning the Kernel

Define the family of kernels as follows:

K :=

{k : k =

∑i=1

βiki , βi ≥ 0, ∀ i ∈ [`]

}.

I If all ki are characteristic and for some i ∈ [`], βi > 0, then k ischaracteristic.

I MMD2Hk

(P,Q) =∑`

i=1 βiMMD2Hki

(P,Q)

I σ2k =

∑`i,j=1 βiβj cov(hki , hkj ) where

hki (x , x′, y , y ′) = ki (x , x

′) + ki (y , y′)− ki (x , y

′)− ki (x′, y).

I Objective:

β∗ = arg maxβ�0

βTη√βTWβ

,

where η := (MMD2Hki

(P,Q))i and W := (cov(hki , hkj ))i,j .

Page 116: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Learning the Kernel

Define the family of kernels as follows:

K :=

{k : k =

∑i=1

βiki , βi ≥ 0, ∀ i ∈ [`]

}.

I If all ki are characteristic and for some i ∈ [`], βi > 0, then k ischaracteristic.

I MMD2Hk

(P,Q) =∑`

i=1 βiMMD2Hki

(P,Q)

I σ2k =

∑`i,j=1 βiβj cov(hki , hkj ) where

hki (x , x′, y , y ′) = ki (x , x

′) + ki (y , y′)− ki (x , y

′)− ki (x′, y).

I Objective:

β∗ = arg maxβ�0

βTη√βTWβ

,

where η := (MMD2Hki

(P,Q))i and W := (cov(hki , hkj ))i,j .

Page 117: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Learning the Kernel

Define the family of kernels as follows:

K :=

{k : k =

∑i=1

βiki , βi ≥ 0, ∀ i ∈ [`]

}.

I If all ki are characteristic and for some i ∈ [`], βi > 0, then k ischaracteristic.

I MMD2Hk

(P,Q) =∑`

i=1 βiMMD2Hki

(P,Q)

I σ2k =

∑`i,j=1 βiβj cov(hki , hkj ) where

hki (x , x′, y , y ′) = ki (x , x

′) + ki (y , y′)− ki (x , y

′)− ki (x′, y).

I Objective:

β∗ = arg maxβ�0

βTη√βTWβ

,

where η := (MMD2Hki

(P,Q))i and W := (cov(hki , hkj ))i,j .

Page 118: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Optimization

I

β∗λ = arg maxβ�0

βT η√βT (W + λI )β

I If η has at least one positive element, the objective function isstrictly positive and so

β∗λ = arg minβ

{βT (W + λI )β : βT η = 1, β � 0

}.

I On the test data:

I Compute ˜MMD2H

k∗using k∗ =

∑`i=1 β

∗λ,iki .

I Compute test threshold tk∗,1−α using σk∗ .

Page 119: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Optimization

I

β∗λ = arg maxβ�0

βT η√βT (W + λI )β

I If η has at least one positive element, the objective function isstrictly positive and so

β∗λ = arg minβ

{βT (W + λI )β : βT η = 1, β � 0

}.

I On the test data:

I Compute ˜MMD2H

k∗using k∗ =

∑`i=1 β

∗λ,iki .

I Compute test threshold tk∗,1−α using σk∗ .

Page 120: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Experiments

I P and Q are mixtures of two-dimensional Gaussians. P has unitcovariance in each component. Q has correlated Gaussians with εbeing the ratio of largest to smallest covariance eigenvalues.

I Testing problem difficulty increases with ε→ 1 and the number ofmixture components.

Page 121: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Competing Approaches

I Median heuristic

I Max. MMD: supk∈K MMD2Hk

(Pm,Qm) — choose k ∈ K with the

largest MMD2Hk

(Pm,Qm)

I Same as maximizing βT η subject to ‖β‖1 ≤ 1.

I `2 statistic: maximize βT η subject to ‖β‖2 ≤ 1.

I Cross-validation on training set.

Page 122: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Results

m = 10, 000 (for training and test). Results are average over 617 trials.

Page 123: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Results

OptimizeMMD2

Hk

σk

Page 124: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Results

Maximize MMD2Hk

with β constraint

Page 125: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

Results

Median heuristic

Page 126: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

References IDudley, R. M. (2002).Real Analysis and Probability.Cambridge University Press, Cambridge, UK.

Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. (2008).Kernel measures of conditional dependence.In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems 20, pages 489–496,Cambridge, MA. MIT Press.

Fukumizu, K., Sriperumbudur, B. K., Gretton, A., and Scholkopf, B. (2009).Characteristic kernels on groups and semigroups.In Advances in Neural Information Processing Systems 21, pages 473–480.

Gretton, A. (2015).A simpler condition for consistency of a kernel independence test.arXiv:1501.06103.

Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2007).A kernel method for the two sample problem.In Advances in Neural Information Processing Systems 19, pages 513–520. MIT Press.

Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf, B., and Smola, A. (2012a).A kernel two-sample test.Journal of Machine Learning Research, 13:723–773.

Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. (2005a).Measuring statistical dependence with Hilbert-Schmidt norms.In Jain, S., Simon, H. U., and Tomita, E., editors, Proceedings of Algorithmic Learning Theory, pages 63–77, Berlin. Springer-Verlag.

Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. K. (2010).A fast, consistent kernel two-sample test.In Advances in Neural Information Processing Systems 22, Cambridge, MA. MIT Press.

Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Scholkopf, B. (2005b).Kernel methods for measuring independence.Journal of Machine Learning Research, 6:2075–2129.

Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Scholkopf, B., and Logothetis,N. (2005c).Kernel constrained covariance for dependence measurement.

In Ghahramani, Z. and Cowell, R., editors, Proc. 10th International Workshop on Artificial Intelligence and Statistics, pages 1–8.

Page 127: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

References II

Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. (2012b).Optimal kernel choice for large-scale two-sample tests.In Advances in Neural Information Processing Systems 24, Cambridge, MA. MIT Press.

Muandet, K., Fukumizu, K., Sriperumbudur, B. K., and Scholkopf, B. (2016a).Kernel mean embedding of distributions: A review and beyond.arXiv:1605.09522.

Muandet, K., Sriperumbudur, B. K., Fukumizu, K., Gretton, A., and Scholkopf, B. (2016b).Kernel mean shrinkage estimators.Journal of Machine Learning Research, 17(48):1–41.

Muller, A. (1997).Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 29:429–443.

Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasserman, L. (2015).On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions.In Proc. of 29th AAAI Conference on Artificial Intelligence, pages 3571–3577.

Simon-Gabriel, C. and Scholkopf, B. (2016).Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions.arXiv:1604.05251.

Smola, A. J., Gretton, A., Song, L., and Scholkopf, B. (2007).A Hilbert space embedding for distributions.In Proc. 18th International Conference on Algorithmic Learning Theory, pages 13–31. Springer-Verlag, Berlin, Germany.

Sriperumbudur, B. K. (2016).On the optimal estimation of probability measures in weak and strong topologies.Bernoulli, 22(3):1839–1893.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Scholkopf, B., and Lanckriet, G. R. G. (2012).On the empirical estimation of integral probability metrics.Electronic Journal of Statistics, 6:1550–1599.

Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. G. (2011).Universality, characteristic kernels and RKHS embedding of measures.Journal of Machine Learning Research, 12:2389–2410.

Page 128: Lecture 23mm Hilbert Space Embedding of Probability Measuresmlss.tuebingen.mpg.de/2017/speaker_slides/Bharath2.pdf · 2017-06-30 · Lecture 2 Hilbert Space Embedding of Probability

References III

Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Scholkopf, B. (2008).Injective Hilbert space embeddings of probability measures.In Servedio, R. and Zhang, T., editors, Proc. of the 21st Annual Conference on Learning Theory, pages 111–122.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Scholkopf, B., and Lanckriet, G. R. G. (2010).Hilbert space embeddings and metrics on probability measures.Journal of Machine Learning Research, 11:1517–1561.

Steinwart, I. and Christmann, A. (2008).Support Vector Machines.Springer.

Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2017).Generative models and model criticism via optimized maximum mean discrepancy.In International Conference on Learning Representations.

Tolstikhin, I., Sriperumbudur, B. K., and Muandet, K. (2016).Minimax estimation of kernel mean embeddings.arXiv:1602.04361.