30
Classification with imperfect training labels Richard J. Samworth University of Cambridge 39th Conference on Applied Statistics in Ireland (CASI2019) Dundalk, Ireland 15 May 2019

Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Classification with imperfect training labels

Richard J. Samworth

University of Cambridge

39th Conference on Applied Statistics in Ireland (CASI2019)Dundalk, Ireland

15 May 2019

Page 2: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Collaborators

Tim Cannings Yingying Fan

Richard J. Samworth 2/26

Page 3: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Supervised classification

Richard J. Samworth 3/26

Page 4: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Classification and label noise

With perfect labels in the binary response se�ing, we observe

(X1, Y1), . . . , (Xn, Yn)iid∼ P taking values in Rd × {0, 1}.

Task: Predict the class Y of a new observation X , where (X,Y ) ∼ Pindependently of the training data.

In many modern applications, however, it may be too expensive, di�icult ortime-consuming to determine class labels perfectly:

Uncorrupted: (X1, 1), (X2, 1), (X3, 0), (X4, 0) . . . , (Xn, 0)

Corrupted: (X1, 1), (X2, 0), (X3, 0), (X4, 0), . . . , (Xn, 1)

Richard J. Samworth 4/26

Page 5: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Existing work

The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).

I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).

I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.

‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.

I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the

presence of label noise’.

Other work seeks to identify mislabelled observations and flip or remove them.

Richard J. Samworth 5/26

Page 6: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Existing work

The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).

I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).

I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.

‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.

I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the

presence of label noise’.

Other work seeks to identify mislabelled observations and flip or remove them.

Richard J. Samworth 5/26

Page 7: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Existing work

The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).

I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).

I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.

‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.

I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the

presence of label noise’.

Other work seeks to identify mislabelled observations and flip or remove them.

Richard J. Samworth 5/26

Page 8: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Motivating example

−4 −2 0 2

−3

−2

−1

01

23

−4 −2 0 2−

3−

2−

10

12

3

Priors π0 = 0.9, π1 = 0.1. Class conditionals X|Y = 0 ∼ N2

((−1, 0)>, I2

),

X|Y = 1 ∼ N2

((1, 0)>, I2

), n = 1000.

Le�: no noise; right: ρ-homogeneous noise with ρ = 0.3.

Richard J. Samworth 6/26

Page 9: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Risks in motivating example

5 6 7 8

81

01

21

4

log(n)

Err

or

Misclassification error for predicting the true label of the test point, for the knn(black), SVM (red) and LDA (blue) classifiers.Solid lines: no label noise; dashed lines: 0.3-homogeneous label noise.

Richard J. Samworth 7/26

Page 10: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Statistical se�ing

Let (X,Y, Y ), (X1, Y1, Y1), . . . , (Xn, Yn, Yn) be i.i.d. triples taking values inX × {0, 1} × {0, 1}.

We observe (X1, Y1), . . . , (Xn, Yn) and X . The task is to predict Y .

I For x ∈ X , define the regression function

η(x) := P(Y = 1|X = x)

and its corrupted version

η(x) := P(Y = 1|X = x).

I For x ∈ X and r ∈ {0, 1}, the conditional noise probabilities are

ρr(x) := P(Y 6= Y |X = x, Y = r).

We also write PX for the marginal distribution of X .

Richard J. Samworth 8/26

Page 11: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Classifiers

A classifier C is a (measurable) function from X to {0, 1}.

The risk R(C) := P{C(X) 6= Y } is minimised by the Bayes classifier

CBayes(x) :=

{1 if η(x) ≥ 1/2

0 otherwise.

A classifier Cn, depending on the training data, is said to be consistent ifR(Cn)→ R(CBayes) as n→∞.

The corrupted risk R(C) := P{C(X) 6= Y } is minimised by the corrupted Bayesclassifier

CBayes(x) :=

{1 if η(x) ≥ 1/2

0 otherwise.

Richard J. Samworth 9/26

Page 12: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

General finite-sample result

Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let

A :=

{x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1

}.

Theorem.(i) PX

(A4{x ∈ B : CBayes(x) = CBayes(x)}

)= 0.

(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}

)= 0, and

PX

(x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗

})= 0.

Then, for any classifier C ,

R(C)−R(CBayes) ≤ R(C)− R(CBayes)

(1− 2ρ∗)(1− a∗).

Richard J. Samworth 10/26

Page 13: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

General finite-sample result

Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let

A :=

{x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1

}.

Theorem.(i) PX

(A4{x ∈ B : CBayes(x) = CBayes(x)}

)= 0.

(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}

)= 0, and

PX

(x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗

})= 0.

Then, for any classifier C ,

R(C)−R(CBayes) ≤ R(C)− R(CBayes)

(1− 2ρ∗)(1− a∗).

Richard J. Samworth 10/26

Page 14: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Discussion

I This result is particularly useful when the classifier C is trained using thenoisy labels, i.e. with (X1, Y1), . . . , (Xn, Yn), since then the training andtest data in R(C) have the same distribution.

I We can then find conditions under which a classifier trained with imperfectlabels will remain consistent for classifying uncorrupted test data points.

For specific classifiers and under stronger conditions, we can provide furthercontrol of the excess risk

R(C)−R(CBayes).

Richard J. Samworth 11/26

Page 15: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

The k-nearest neighbour classifier

For x ∈ Rd, let (X(1), Y(1)), . . . , (X(n), Y(n)) be the reordering of the corruptedtraining data pairs such that

‖X(1) − x‖ ≤ . . . ≤ ‖X(n) − x‖.

Define

Cknn(x) :=

{1 if 1

k

∑ki=1 1{Y(i)=1} ≥ 1/2

0 otherwise.

Corollary. Assume the conditions of part (ii) of the lemma. If k = kn →∞,but k/n→ 0, then

R(Cknn)−R(CBayes)→ 0

as n→∞.

Richard J. Samworth 12/26

Page 16: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Further assumptions

I Label noise: Assume the conditions of part (ii) of the lemma and that

ρ0(x) = g(η(x)) and ρ1(x) = g(1− η(x)),

where g : (0, 1)→ [0, 1) is twice di�erentiable. Assume thatg′(1/2) > 2g(1/2)− 1 and that g′′ is uniformly continuous.

I Distribution (Cannings et al., 2018): Among other technical conditions, assume thatPX has a density f , that η is twice continuously di�erentiable withinfx0∈S ‖η′(x0)‖ > 0, and that∫

Rd‖x‖αf(x) dx <∞.

I For β ∈ (0, 1/2), let

Kβ := {d(n− 1)βe, . . . , b(n− 1)1−βc}.

Richard J. Samworth 13/26

Page 17: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Asymptotic expansion

Theorem. Under our assumptions, we have two cases:

(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there

exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),

R(Cknn)−R(CBayes) =B1

k{1−2g(1/2)+g′(1/2)}2+B2

(kn

)4/d

+ o(νn,k)

as n→∞, uniformly for k ∈ Kβ .

(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and

β ∈ (0, 1/2), we have

R(Cknn)−R(CBayes) =B1

k{1− 2g(1/2) + g′(1/2)}2+ o

(1

k+(kn

) αα+d−ε

).

as n→∞, uniformly for k ∈ Kβ .

Richard J. Samworth 14/26

Page 18: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Asymptotic expansion

Theorem. Under our assumptions, we have two cases:

(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there

exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),

R(Cknn)−R(CBayes) =B1

k{1−2g(1/2)+g′(1/2)}2+B2

(kn

)4/d

+ o(νn,k)

as n→∞, uniformly for k ∈ Kβ .

(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and

β ∈ (0, 1/2), we have

R(Cknn)−R(CBayes) =B1

k{1− 2g(1/2) + g′(1/2)}2+ o

(1

k+(kn

) αα+d−ε

).

as n→∞, uniformly for k ∈ Kβ .

Richard J. Samworth 14/26

Page 19: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Relative asymptotic performance

Given k to be used by the knn classifier in the noiseless case, let

kg :=⌊{1− 2g(1/2) + g′(1/2)}−2d/(d+4)k

⌋.

This coupling reflects the ratio of the optimal choices of k in the corrupted anduncorrupted se�ings.

Corollary. Under the assumptions of part (i) of the theorem, and providedB2 > 0, we have that for any β ∈ (0, 1/2),

R(Ckgnn)−R(CBayes)

R(Cknn)−R(CBayes)→ 1

{1− 2g(1/2) + g′(1/2)}8/(d+4)

as n→∞, uniformly for k ∈ Kβ .

If g′(1/2) > 2g(1/2), then the label noise improves the asymptotic performance!

Richard J. Samworth 15/26

Page 20: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Intuition

For x ∈ Sc, we have

η(x)− 1/2 = {1− ρ1(x)}η(x) + ρ0(x){1− η(x)} − 1/2

= {η(x)− 1/2}{

1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)

2η(x)− 1

}.

But, writing t := η(x)− 1/2,

1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)

2η(x)− 1

= 1− g(1/2 + t)− g(1/2− t) +g(1/2 + t)− g(1/2− t)

2tt→0→ 1− 2g(1/2) + g′(1/2).

Richard J. Samworth 16/26

Page 21: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Estimated regret ratios

Model: X|Y = r ∼ N5(µr, I5), where µ1 = (3/2, 0, 0, 0, 0)T = −µ0, π1 = 0.5.

Labels: Let g(1/2 + t) = 0 ∨min{g0(1 + h0t), 2g0}, then set ρ0(x) = g(η(x))

and ρ1(x) = g(1− η(x)).

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

1.0

1.5

2.0

2.5

log(n)

Re

gre

t R

atio

g0 h0 Asymptotic RR0.1 −1 1.37

0.1 0 1.22

0.1 1 1.10

0.1 2 1

0.1 3 0.92

Richard J. Samworth 17/26

Page 22: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Support Vector Machines

LetH denote an RKHS, and let L(y, t) := max{0, 1− (2y − 1)t} denote thehinge loss function. The SVM classifier is given by

CSVM(x) :=

{1 if f(x) ≥ 0

0 otherwise,

where

f ∈ argminf∈H

{1

n

n∑i=1

L(Yi, f(Xi)) + λ‖f‖2H}.

We focus on the case whereH has the Gaussian radial basis reproducing kernelfunction K(x, x′) := exp(−σ2‖x− x′‖2), for σ > 0.

Richard J. Samworth 18/26

Page 23: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

SVM asymptotic analysis

If PX is compactly supported and λ = λn is chosen appropriately then thisSVM classifier is consistent in the uncorrupted labels case (Steinwart, 2005).

Corollary. Assume the conditions of our lemma, and suppose that PX iscompactly supported. If λ = λn → 0 but nλn

| log λn|d+1 →∞, then

R(CSVM)−R(CBayes)→ 0

as n→∞.

Richard J. Samworth 19/26

Page 24: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

SVM assumptions

1. We say that the distribution P satisfies the margin assumption withparameter γ1 ∈ [0,∞) if there exists κ1 > 0 such that

PX({x ∈ Rd : 0 < |η(x)− 1/2| ≤ t}) ≤ κ1tγ1

for all t > 0.

2. Let S+ := {x ∈ Rd : η(x) > 1/2} and S− := {x ∈ Rd : η(x) < 1/2}, andfor x ∈ Rd, let τx := infx′∈S∪S+ ‖x− x′‖+ infx′∈S∪S− ‖x− x′‖. Say Phas geometric noise exponent γ2 ∈ [0,∞) if there exists κ2 > 0, such that∫

Rd|2η(x)− 1| exp

(τ2x

t2

)dPX(x) ≤ κ2t

γ2d

for all t > 0.

Richard J. Samworth 20/26

Page 25: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Rate of convergence

With perfect labels and when PX(B(0, 1)

)= 1, the excess risk of the SVM

classifier is O(n−Γ+ε) for every ε > 0, where

Γ :=

{γ2

2γ2+1 if γ2 ≤ γ1+22γ1

2γ2(γ1+1)2γ2(γ1+2)+3γ1+4 otherwise.

(Steinwart and Scovel, 2007).

Theorem. Suppose that P has margin parameter γ1 ∈ [0,∞], geometric noiseexponent γ2 ∈ (0,∞) and PX

(B(0, 1)

)= 1. Assume the conditions of the

lemma and that ρ0(x) = g(η(x)), ρ1(x) = g(1− η(x)), where g : (0, 1)→ [0, 1)

is di�erentiable at 1/2.

Let λ = λn := n−(γ2+1)/(γ2Γ) and σ = σn := nΓ/(γ2d). Then

R(CSVM)−R(CBayes) = O(n−Γ+ε),

as n→∞, for every ε > 0.

Richard J. Samworth 21/26

Page 26: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Linear Discriminant Analysis

Suppose that Pr = Nd(µr,Σ) for r = 0, 1. Then

CBayes(x) =

{1 if log

(π1

π0

)+(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) ≥ 0

0 otherwise.

Define

CLDA(x) :=

{1 if log

(π1

π0

)+(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) ≥ 0

0 otherwise,

where, πr := n−1∑ni=1 1{Yi=r}, µr :=

∑ni=1Xi1{Yi=r}/

∑ni=1 1{Yi=r}, and

Σ :=1

n− 2

n∑i=1

1∑r=0

(Xi − µr)(Xi − µr)>1{Yi=r}.

Richard J. Samworth 22/26

Page 27: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

LDA asymptotic analysis

Theorem. Assume we have ρ-homogeneous noise (ρ < 1/2) and suppose thatPr = Nd(µr,Σ), for r = 0, 1. Then

limn→∞

CLDA(x) =

{1 if c0 +

(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) > 0

0 if c0 +(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) < 0,

where c0 can be expressed in terms of ∆2 := (µ1 − µ0)TΣ−1(µ1 − µ0), ρand π1. As a consequence,

limn→∞

R(CLDA) = π0Φ

(c0∆− ∆

2

)+ π1Φ

(−c0

∆− ∆

2

)≥ R(CBayes), (1)

with equality if π0 = π1 = 1/2. Moreover, for each ρ ∈ (0, 1/2) and π0 6= π1,there is a unique value of ∆ > 0 for which we have equality in (1).

Richard J. Samworth 23/26

Page 28: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

LDA with ρ-homogeneous noise

5 6 7 8

51

01

5

log(n)

Err

or

Here, X|{Y = r} ∼ N5(µr, I5), where µ1 = ( 32 , 0, . . . , 0)> = −µ0 ∈ R5, and

π1 = 0.9.

No label noise (black), ρ-homogeneous noise for ρ = 0.1 (red), 0.2 (blue), 0.3(green) and 0.4 (purple). The do�ed lines show our asymptotic limit.

Richard J. Samworth 24/26

Page 29: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Summary

I The knn and SVM classifiers remain consistent with label noise under mildassumptions on the noise mechanism and data distribution.

I Under stronger conditions, the rate of convergence of the excess risk forthese classifiers is preserved.

I However, the LDA classifier is typically not consistent, unless the classpriors are equal (even with homogeneous noise).

Main reference:

I Cannings, T. I., Fan, Y. and Samworth, R. J. (2018) Classification withimperfect training labels. https://arxiv.org/abs/1805.11505.

Richard J. Samworth 25/26

Page 30: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Other referencesI Cannings, T. I., Berre�, T. B. and Samworth, R. J. (2018) Local nearest neighbour

classification with applications to semi-supervised learning.https://arxiv.org/abs/1704.00642v2.

I Frénay, B. and Kabán, A. (2014) A comprehensive introduction to label noise. Proc.Euro. Sym. Artificial Neural Networks, 667–676.

I Frénay, B. and Verleysen, M. (2014) Classification in the presence of label noise: asurvey. IEEE Trans. on NN and Learn. Sys., 25, 845–869.

I Ghosh, A., Manwani, N. and Sastry, P. S. (2015) Making risk minimization tolerantto label noise. Neurocomputing, 160, 93–107.

I Lachenbruch, P. A. (1966) Discriminant analysis when the initial samples aremisclassified. Technometrics, 8, 657–662.

I Okamoto, S. and Nobuhiro, Y. (1997) An average-case analysis of the k-nearestneighbor classifier for noisy domains. In Proc. 15th Int. Joint Conf. Artif. Intell., 1,238–243.

I Steinwart, I. (2005) Consistency of support vector machines and other regularizedkernel classifiers. IEEE Trans. Inf. Th., 51, 128–142.

I Steinwart, I. and Scovel, C. (2007) Fast rates for support vector machines usingGaussian kernels. Ann. Statist., 35, 575–607.

Richard J. Samworth 26/26