79
1/34 An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck Institute for Intelligent Systems ubingen, Germany Jeju, South Korea, February 22, 2019

An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

1/34

An Introduction to Hilbert Space Embedding ofProbability Measures

Krikamol MuandetMax Planck Institute for Intelligent Systems

Tubingen, Germany

Jeju, South Korea, February 22, 2019

Page 2: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

2/34

Reference

Kernel Mean Embedding of Distributions: A Review and BeyondM, Fukumizu, Sriperumbudur, and Scholkopf. FnT ML, 2017.

Page 3: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

3/34

From Points to Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Future Directions

Page 4: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

4/34

From Points to Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Future Directions

Page 5: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

5/34

Classification Problem

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

Page 6: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

6/34

Feature Map

φ : (x1, x2) 7−→ (x21 , x22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(x′)〉R3 = (x · x′)2

Page 7: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

6/34

Feature Map

φ : (x1, x2) 7−→ (x21 , x22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(x′)〉R3 = (x · x′)2

Page 8: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

6/34

Feature Map

φ : (x1, x2) 7−→ (x21 , x22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(x′)〉R3 = (x · x′)2

Page 9: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

6/34

Feature Map

φ : (x1, x2) 7−→ (x21 , x22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(x′)〉R3 = (x · x′)2

Page 10: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

7/34

Our recipe:

1. Construct a non-linear feature map φ : X → H.

2. Evaluate Dφ = φ(x1), φ(x2), . . . , φ(xn).3. Solve the learning problem in H using Dφ.

Page 11: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

7/34

Our recipe:

1. Construct a non-linear feature map φ : X → H.

2. Evaluate Dφ = φ(x1), φ(x2), . . . , φ(xn).3. Solve the learning problem in H using Dφ.

Page 12: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

8/34

Kernels

DefinitionA function k : X ×X → R is called a kernel on X if there exists a Hilbertspace H and a map φ : X → H such that for all x, x′ ∈ X we have

k(x, x′) = 〈φ(x), φ(x′)〉H.

We call φ a feature map and H a feature space of k.

Example

1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

I φ(x) = (x21 , x

22 ,√2x1x2)

I H = R3

2. k(x, x′) = (x · x′ + c)m for c > 0, x, x′ ∈ Rd

I dim(H) =(d+mm

)3. k(x, x′) = exp

(−γ‖x− x′‖22

)I H = R∞

Page 13: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

8/34

Kernels

DefinitionA function k : X ×X → R is called a kernel on X if there exists a Hilbertspace H and a map φ : X → H such that for all x, x′ ∈ X we have

k(x, x′) = 〈φ(x), φ(x′)〉H.

We call φ a feature map and H a feature space of k.

Example

1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

I φ(x) = (x21 , x

22 ,√2x1x2)

I H = R3

2. k(x, x′) = (x · x′ + c)m for c > 0, x, x′ ∈ Rd

I dim(H) =(d+mm

)3. k(x, x′) = exp

(−γ‖x− x′‖22

)I H = R∞

Page 14: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

9/34

Positive Definite Kernels

Definition (Positive definiteness)A function k : X × X → R is called positive definite if, for all n ∈ N,α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X , we have

n∑i=1

n∑j=1

αiαjk(xj , xi ) ≥ 0.

Equivalently, we have that a Gram matrix K is positive definite.

Example (Any kernel is positive definite)Let k be a kernel with feature map φ : X → H, then we have

n∑i=1

n∑j=1

αiαjk(xj , xi ) =

⟨n∑

i=1

αiφ(xi ),n∑

j=1

αjφ(xj)

⟩H

≥ 0.

Positive definiteness is a necessary (and sufficient) condition.

Page 15: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

9/34

Positive Definite Kernels

Definition (Positive definiteness)A function k : X × X → R is called positive definite if, for all n ∈ N,α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X , we have

n∑i=1

n∑j=1

αiαjk(xj , xi ) ≥ 0.

Equivalently, we have that a Gram matrix K is positive definite.

Example (Any kernel is positive definite)Let k be a kernel with feature map φ : X → H, then we have

n∑i=1

n∑j=1

αiαjk(xj , xi ) =

⟨n∑

i=1

αiφ(xi ),n∑

j=1

αjφ(xj)

⟩H

≥ 0.

Positive definiteness is a necessary (and sufficient) condition.

Page 16: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

1. A function k : X × X → R is called a reproducing kernel of H if wehave k(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉

holds for all f ∈ H and all x ∈ X .

2. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

Remark: If ‖fn − f ‖H → 0 for n→∞, then for all x ∈ X , we have

limn→∞

fn(x) = f (x)

.

Page 17: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

1. A function k : X × X → R is called a reproducing kernel of H if wehave k(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉

holds for all f ∈ H and all x ∈ X .

2. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

Remark: If ‖fn − f ‖H → 0 for n→∞, then for all x ∈ X , we have

limn→∞

fn(x) = f (x)

.

Page 18: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

1. A function k : X × X → R is called a reproducing kernel of H if wehave k(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉

holds for all f ∈ H and all x ∈ X .

2. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

Remark: If ‖fn − f ‖H → 0 for n→∞, then for all x ∈ X , we have

limn→∞

fn(x) = f (x)

.

Page 19: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

10/34

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of functions mapping from X into R.

1. A function k : X × X → R is called a reproducing kernel of H if wehave k(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉

holds for all f ∈ H and all x ∈ X .

2. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

Remark: If ‖fn − f ‖H → 0 for n→∞, then for all x ∈ X , we have

limn→∞

fn(x) = f (x)

.

Page 20: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

11/34

Reproducing Kernels

Lemma (Reproducing kernels are kernels)Let H be a Hilbert space over X with a reproducing kernel k. Then H isan RKHS and is also a feature space of k , where the feature mapφ : X → H is given by

φ(x) = k(·, x), x ∈ X .

We call φ the canonical feature map.

ProofWe fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X , thereproducing property yields

〈φ(x′), φ(x)〉 = 〈k(·, x′), k(·, x)〉 = 〈f , k(·, x)〉 = f (x) = k(x, x′).

Page 21: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

11/34

Reproducing Kernels

Lemma (Reproducing kernels are kernels)Let H be a Hilbert space over X with a reproducing kernel k. Then H isan RKHS and is also a feature space of k , where the feature mapφ : X → H is given by

φ(x) = k(·, x), x ∈ X .

We call φ the canonical feature map.

ProofWe fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X , thereproducing property yields

〈φ(x′), φ(x)〉 = 〈k(·, x′), k(·, x)〉 = 〈f , k(·, x)〉 = f (x) = k(x, x′).

Page 22: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

12/34

Kernels and RKHSs

Theorem (Every RKHS has a unique reproducing kernel)Let H be an RKHS over X . Then k : X × X → R defined by

k(x, x′) = 〈δx, δx′〉H, x, x′ ∈ X

is the only reproducing kernel of H. Furthermore, if (ei )i∈I is anorthonormal basis of H, then for all x, x′ ∈ X we have

k(x, x′) =∑i∈I

ei (x)ei (x′).

Universal kernelsA continuous kernel k on a compact metric space X is called universal ifthe RKHS H of k is dense in C (X ), i.e., for every function g ∈ C (X )and all ε > 0 there exist an f ∈ H such that

‖f − g‖∞ ≤ ε.

Page 23: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

12/34

Kernels and RKHSs

Theorem (Every RKHS has a unique reproducing kernel)Let H be an RKHS over X . Then k : X × X → R defined by

k(x, x′) = 〈δx, δx′〉H, x, x′ ∈ X

is the only reproducing kernel of H. Furthermore, if (ei )i∈I is anorthonormal basis of H, then for all x, x′ ∈ X we have

k(x, x′) =∑i∈I

ei (x)ei (x′).

Universal kernelsA continuous kernel k on a compact metric space X is called universal ifthe RKHS H of k is dense in C (X ), i.e., for every function g ∈ C (X )and all ε > 0 there exist an f ∈ H such that

‖f − g‖∞ ≤ ε.

Page 24: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

13/34

From Points to Measures

Feature space HInput space X

x

y

k(x, ·)k(y, ·)

f

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z)

Page 25: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

13/34

From Points to Measures

Feature space HInput space X

x

y

k(x, ·)k(y, ·)

f

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z)

Page 26: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

14/34

From Points to Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Future Directions

Page 27: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

15/34

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

DefinitionLet P be a space of all probability measures on a measurable space(X ,Σ) and H an RKHS endowed with a reproducing kernelk : X × X → R. A kernel mean embedding is defined by

µ : P → H, P 7→∫

k(·, x)dP(x).

Remark: For a Dirac measure δx, δx 7→ µ[δx] ≡ x 7→ k(·, x).

Page 28: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

15/34

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

DefinitionLet P be a space of all probability measures on a measurable space(X ,Σ) and H an RKHS endowed with a reproducing kernelk : X × X → R. A kernel mean embedding is defined by

µ : P → H, P 7→∫

k(·, x)dP(x).

Remark: For a Dirac measure δx, δx 7→ µ[δx] ≡ x 7→ k(·, x).

Page 29: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

15/34

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

DefinitionLet P be a space of all probability measures on a measurable space(X ,Σ) and H an RKHS endowed with a reproducing kernelk : X × X → R. A kernel mean embedding is defined by

µ : P → H, P 7→∫

k(·, x)dP(x).

Remark: For a Dirac measure δx, δx 7→ µ[δx] ≡ x 7→ k(·, x).

Page 30: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

16/34

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

I If EX∼P[√k(X ,X )] <∞, then µP ∈ H and

EX∼P[f (X )] = 〈f , µP〉, f ∈ H.

I The kernel k is said to be characteristic if the map

P 7→ µP

is injective. That is, ‖µP − µQ‖H = 0 if and only if P = Q.

Page 31: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

16/34

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

I If EX∼P[√k(X ,X )] <∞, then µP ∈ H and

EX∼P[f (X )] = 〈f , µP〉, f ∈ H.

I The kernel k is said to be characteristic if the map

P 7→ µP

is injective. That is, ‖µP − µQ‖H = 0 if and only if P = Q.

Page 32: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

17/34

Kernel Mean EstimationI Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP :=1

n

n∑i=1

k(xi , ·).

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Assume that ‖f ‖∞ ≤ 1 for all f ∈ H with ‖f ‖H ≤ 1. W.p.a.l 1− δ,

‖µP − µP‖H ≤ 2

√Ex∼P[k(x , x)]

n+

√2 log 1

δ

n.

I The convergence happens at a rate Op(n−1/2) which has been shownto be minimax optimal.1

I In high dimensional setting, we can improve an estimation byshrinkage estimators:2

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 33: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

17/34

Kernel Mean EstimationI Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP :=1

n

n∑i=1

k(xi , ·).

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.

I Assume that ‖f ‖∞ ≤ 1 for all f ∈ H with ‖f ‖H ≤ 1. W.p.a.l 1− δ,

‖µP − µP‖H ≤ 2

√Ex∼P[k(x , x)]

n+

√2 log 1

δ

n.

I The convergence happens at a rate Op(n−1/2) which has been shownto be minimax optimal.1

I In high dimensional setting, we can improve an estimation byshrinkage estimators:2

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 34: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

17/34

Kernel Mean EstimationI Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP :=1

n

n∑i=1

k(xi , ·).

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Assume that ‖f ‖∞ ≤ 1 for all f ∈ H with ‖f ‖H ≤ 1. W.p.a.l 1− δ,

‖µP − µP‖H ≤ 2

√Ex∼P[k(x , x)]

n+

√2 log 1

δ

n.

I The convergence happens at a rate Op(n−1/2) which has been shownto be minimax optimal.1

I In high dimensional setting, we can improve an estimation byshrinkage estimators:2

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 35: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

17/34

Kernel Mean EstimationI Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP :=1

n

n∑i=1

k(xi , ·).

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Assume that ‖f ‖∞ ≤ 1 for all f ∈ H with ‖f ‖H ≤ 1. W.p.a.l 1− δ,

‖µP − µP‖H ≤ 2

√Ex∼P[k(x , x)]

n+

√2 log 1

δ

n.

I The convergence happens at a rate Op(n−1/2) which has been shownto be minimax optimal.1

I In high dimensional setting, we can improve an estimation byshrinkage estimators:2

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 36: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

17/34

Kernel Mean EstimationI Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP :=1

n

n∑i=1

k(xi , ·).

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Assume that ‖f ‖∞ ≤ 1 for all f ∈ H with ‖f ‖H ≤ 1. W.p.a.l 1− δ,

‖µP − µP‖H ≤ 2

√Ex∼P[k(x , x)]

n+

√2 log 1

δ

n.

I The convergence happens at a rate Op(n−1/2) which has been shownto be minimax optimal.1

I In high dimensional setting, we can improve an estimation byshrinkage estimators:2

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

1Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.2Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 37: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

18/34

Explicit Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionConsider k(x , y) = ψ(x − y), x , y ∈ Rd where ψ is a positive definitefunction. Then,

µP(y) =

∫ψ(x − y)dP(x) = Λ · P

for positive finite measure Λ.

Page 38: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

18/34

Explicit Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionConsider k(x , y) = ψ(x − y), x , y ∈ Rd where ψ is a positive definitefunction. Then,

µP(y) =

∫ψ(x − y)dP(x) = Λ · P

for positive finite measure Λ.

Page 39: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

18/34

Explicit Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionConsider k(x , y) = ψ(x − y), x , y ∈ Rd where ψ is a positive definitefunction. Then,

µP(y) =

∫ψ(x − y)dP(x) = Λ · P

for positive finite measure Λ.

Page 40: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

19/34

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

DomainAdaptation/Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 41: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

19/34

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

DomainAdaptation/Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 42: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

19/34

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

DomainAdaptation/Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 43: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

19/34

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

DomainAdaptation/Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 44: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

20/34

Support Measure Machine (SMM)

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z) P 7→

∫k(·, z)dP(z)

TheoremUnder technical assumptions on Ω : [0,+∞)→ R, and a loss function` : (P × R2)m → R ∪ +∞, any f ∈ H minimizing

` (P1, y1,EP1 [f ], . . . ,Pm, ym,EPm [f ]) + Ω (‖f ‖H)

admits a representation of the form

f =m∑i=1

αiEx∼Pi [k(x , ·)] =m∑i=1

αiµPi .

Page 45: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

20/34

Support Measure Machine (SMM)

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z) P 7→

∫k(·, z)dP(z)

TheoremUnder technical assumptions on Ω : [0,+∞)→ R, and a loss function` : (P × R2)m → R ∪ +∞, any f ∈ H minimizing

` (P1, y1,EP1 [f ], . . . ,Pm, ym,EPm [f ]) + Ω (‖f ‖H)

admits a representation of the form

f =m∑i=1

αiEx∼Pi [k(x , ·)] =m∑i=1

αiµPi .

Page 46: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

21/34

Kernel Discrepancy Measure for Distributions

I Maximum mean discrepancy (MMD)

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣

I MMD is an integral probability metric (IPM) and corresponds tothe RKHS distance between mean embeddings.

MMD2(P,Q,H) = ‖µP − µQ‖2H.

I If k is universal, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 47: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

21/34

Kernel Discrepancy Measure for Distributions

I Maximum mean discrepancy (MMD)

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣I MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings.

MMD2(P,Q,H) = ‖µP − µQ‖2H.

I If k is universal, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 48: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

21/34

Kernel Discrepancy Measure for Distributions

I Maximum mean discrepancy (MMD)

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣I MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings.

MMD2(P,Q,H) = ‖µP − µQ‖2H.

I If k is universal, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 49: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

21/34

Kernel Discrepancy Measure for Distributions

I Maximum mean discrepancy (MMD)

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣I MMD is an integral probability metric (IPM) and corresponds to

the RKHS distance between mean embeddings.

MMD2(P,Q,H) = ‖µP − µQ‖2H.

I If k is universal, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 50: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

22/34

Generative Adversarial NetworksLearn a deep generative model G via a minimax optimization

minG

maxD

Ex [logD(x)] + Ez [log(1− D(G (z)))]

where D is a discriminator and z ∼ N (0, σ2I).

random noise z

Gθ(z)

Generator Gθ

real or synthetic?

x or Gθ(z)

Discriminator Dφ

•••••••••

••••××× ××××××××××

real dataxi synthetic data

Gθ(zi )

∥∥µX − µGθ(Z)

∥∥H is zero?

MMD Test

Page 51: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

23/34

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2H = min

θ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors (Sutherland et al., 2017; Li

et al., 2017a),I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)

Page 52: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

23/34

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2H = min

θ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors (Sutherland et al., 2017; Li

et al., 2017a),I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)

Page 53: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

23/34

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2H = min

θ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors (Sutherland et al., 2017; Li

et al., 2017a),I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)

Page 54: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

24/34

From Points to Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Future Directions

Page 55: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

25/34

Conditional Distribution P(Y |X )?

X

Y

A collection of distributions PY := P(Y |X = x) : x ∈ X.

I For each x ∈ X , we can define an embedding of P(Y |X = x) as

µY |x :=

∫Y

ϕ(Y ) dP(Y |X = x) = EY |x [ϕ(Y )]

where ϕ : Y → G is a feature map of Y .

Page 56: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

25/34

Conditional Distribution P(Y |X )?

X

Y

A collection of distributions PY := P(Y |X = x) : x ∈ X.I For each x ∈ X , we can define an embedding of P(Y |X = x) as

µY |x :=

∫Y

ϕ(Y ) dP(Y |X = x) = EY |x [ϕ(Y )]

where ϕ : Y → G is a feature map of Y .

Page 57: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

26/34

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX : H → H and CYX : H → G be the covariance operator onX and cross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 58: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

26/34

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX : H → H and CYX : H → G be the covariance operator onX and cross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 59: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

26/34

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX : H → H and CYX : H → G be the covariance operator onX and cross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 60: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

26/34

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX : H → H and CYX : H → G be the covariance operator onX and cross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 61: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

27/34

Embedding of Conditional Distributions

X

Y

H GCYXC−1XXk(x , ·)

µY |X=xk(x , ·) CYXC−1XX

y

p(y |x)

P(Y |X = x)

The conditional mean embedding of P(Y |X ) can be defined as

UY |X : H → G, UY |X := CYXC−1XX

Page 62: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

28/34

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G

for all g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

Page 63: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

28/34

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G

for all g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

Page 64: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

28/34

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G

for all g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

Page 65: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

28/34

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G

for all g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

Page 66: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

29/34

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where λ > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 67: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

29/34

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where λ > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 68: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

29/34

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where λ > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 69: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

30/34

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )] = UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =n∑

j=1

βjφ(xj)

with β = (L + nλI )−1Lα.

Page 70: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

30/34

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )] = UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =n∑

j=1

βjφ(xj)

with β = (L + nλI )−1Lα.

Page 71: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

30/34

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )] = UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =n∑

j=1

βjφ(xj)

with β = (L + nλI )−1Lα.

Page 72: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

31/34

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 73: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

31/34

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 74: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

31/34

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 75: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

31/34

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 76: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

31/34

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.3

3Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 77: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

32/34

From Points to Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Future Directions

Page 78: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

33/34

Future Directions

I Representation learning and embedding of distributions

I Kernel methods in deep learningI MMD-GANI Wasserstein autoencoder (WAE)I Invariant learning in deep neural networks

I Kernel mean estimation in high dimensional setting

I Recovering (conditional) distributions from mean embeddings

Page 79: An Introduction to Hilbert Space Embedding of Probability ...aisociety.kr/KJMLW2019/slides/kme-korea-22-02-2019.pdf · An Introduction to Hilbert Space Embedding of Probability Measures

34/34

Q & A