Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay

Classification with imperfect training labels

Richard J. Samworth

University of Cambridge

39th Conference on Applied Statistics in Ireland (CASI2019)Dundalk, Ireland

15 May 2019

Collaborators

Tim Cannings Yingying Fan

Richard J. Samworth 2/26

Supervised classification


Classification and label noise

With perfect labels in the binary response se�ing, we observe

(X1, Y1), . . . , (Xn, Yn)iid∼ P taking values in Rd × {0, 1}.

Task: Predict the class Y of a new observation X , where (X,Y ) ∼ Pindependently of the training data.

In many modern applications, however, it may be too expensive, di�icult ortime-consuming to determine class labels perfectly:

Uncorrupted: (X1, 1), (X2, 1), (X3, 0), (X4, 0) . . . , (Xn, 0)

Corrupted: (X1, 1), (X2, 0), (X3, 0), (X4, 0), . . . , (Xn, 1)


Existing work

The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).

I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).

I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.

‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.

I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the

presence of label noise’.

Other work seeks to identify mislabelled observations and flip or remove them.


Existing work









Existing work









Motivating example

−4 −2 0 2

−3

−2

−1

01

23

−4 −2 0 2−

3−

2−

10

12

3

Priors π0 = 0.9, π1 = 0.1. Class conditionals X|Y = 0 ∼ N2

((−1, 0)>, I2

),

X|Y = 1 ∼ N2

((1, 0)>, I2

), n = 1000.

Le�: no noise; right: ρ-homogeneous noise with ρ = 0.3.


Risks in motivating example

5 6 7 8

81

01

21

4

log(n)

Err

or

Misclassification error for predicting the true label of the test point, for the knn(black), SVM (red) and LDA (blue) classifiers.Solid lines: no label noise; dashed lines: 0.3-homogeneous label noise.


Statistical se�ing

Let (X,Y, Y ), (X1, Y1, Y1), . . . , (Xn, Yn, Yn) be i.i.d. triples taking values inX × {0, 1} × {0, 1}.

We observe (X1, Y1), . . . , (Xn, Yn) and X . The task is to predict Y .

I For x ∈ X , define the regression function

η(x) := P(Y = 1|X = x)

and its corrupted version

η(x) := P(Y = 1|X = x).

I For x ∈ X and r ∈ {0, 1}, the conditional noise probabilities are

ρr(x) := P(Y 6= Y |X = x, Y = r).

We also write PX for the marginal distribution of X .


Classifiers

A classifier C is a (measurable) function from X to {0, 1}.

The risk R(C) := P{C(X) 6= Y } is minimised by the Bayes classifier

CBayes(x) :=

{1 if η(x) ≥ 1/2

0 otherwise.

A classifier Cn, depending on the training data, is said to be consistent ifR(Cn)→ R(CBayes) as n→∞.

The corrupted risk R(C) := P{C(X) 6= Y } is minimised by the corrupted Bayesclassifier

CBayes(x) :=

{1 if η(x) ≥ 1/2

0 otherwise.


General finite-sample result

Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let

A :=

{x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1

}.

Theorem.(i) PX

(A4{x ∈ B : CBayes(x) = CBayes(x)}

)= 0.

(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}

)= 0, and

PX

(x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗

})= 0.

Then, for any classifier C ,

R(C)−R(CBayes) ≤ R(C)− R(CBayes)

(1− 2ρ∗)(1− a∗).


General finite-sample result

Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let

A :=

{x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1

}.

Theorem.(i) PX

(A4{x ∈ B : CBayes(x) = CBayes(x)}

)= 0.

(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}

)= 0, and

PX

(x ∈ B :

ρ1(x)− ρ0(x)

{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗

})= 0.

Then, for any classifier C ,

R(C)−R(CBayes) ≤ R(C)− R(CBayes)

(1− 2ρ∗)(1− a∗).


Discussion

I This result is particularly useful when the classifier C is trained using thenoisy labels, i.e. with (X1, Y1), . . . , (Xn, Yn), since then the training andtest data in R(C) have the same distribution.

I We can then find conditions under which a classifier trained with imperfectlabels will remain consistent for classifying uncorrupted test data points.

For specific classifiers and under stronger conditions, we can provide furthercontrol of the excess risk

R(C)−R(CBayes).


The k-nearest neighbour classifier

For x ∈ Rd, let (X(1), Y(1)), . . . , (X(n), Y(n)) be the reordering of the corruptedtraining data pairs such that

‖X(1) − x‖ ≤ . . . ≤ ‖X(n) − x‖.

Define

Cknn(x) :=

{1 if 1

k

∑ki=1 1{Y(i)=1} ≥ 1/2

0 otherwise.

Corollary. Assume the conditions of part (ii) of the lemma. If k = kn →∞,but k/n→ 0, then

R(Cknn)−R(CBayes)→ 0

as n→∞.


Further assumptions

I Label noise: Assume the conditions of part (ii) of the lemma and that

ρ0(x) = g(η(x)) and ρ1(x) = g(1− η(x)),

where g : (0, 1)→ [0, 1) is twice di�erentiable. Assume thatg′(1/2) > 2g(1/2)− 1 and that g′′ is uniformly continuous.

I Distribution (Cannings et al., 2018): Among other technical conditions, assume thatPX has a density f , that η is twice continuously di�erentiable withinfx0∈S ‖η′(x0)‖ > 0, and that∫

Rd‖x‖αf(x) dx <∞.

I For β ∈ (0, 1/2), let

Kβ := {d(n− 1)βe, . . . , b(n− 1)1−βc}.


Asymptotic expansion

Theorem. Under our assumptions, we have two cases:

(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there

exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),

R(Cknn)−R(CBayes) =B1

k{1−2g(1/2)+g′(1/2)}2+B2

(kn

)4/d

+ o(νn,k)

as n→∞, uniformly for k ∈ Kβ .

(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and

β ∈ (0, 1/2), we have


k{1− 2g(1/2) + g′(1/2)}2+ o

(1

k+(kn

) αα+d−ε

).



Asymptotic expansion

Theorem. Under our assumptions, we have two cases:

(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there

exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),


k{1−2g(1/2)+g′(1/2)}2+B2

(kn

)4/d

+ o(νn,k)


(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and

β ∈ (0, 1/2), we have


k{1− 2g(1/2) + g′(1/2)}2+ o

(1

k+(kn

) αα+d−ε

).



Relative asymptotic performance

Given k to be used by the knn classifier in the noiseless case, let

kg :=⌊{1− 2g(1/2) + g′(1/2)}−2d/(d+4)k

⌋.

This coupling reflects the ratio of the optimal choices of k in the corrupted anduncorrupted se�ings.

Corollary. Under the assumptions of part (i) of the theorem, and providedB2 > 0, we have that for any β ∈ (0, 1/2),

R(Ckgnn)−R(CBayes)

R(Cknn)−R(CBayes)→ 1

{1− 2g(1/2) + g′(1/2)}8/(d+4)


If g′(1/2) > 2g(1/2), then the label noise improves the asymptotic performance!


Intuition

For x ∈ Sc, we have

η(x)− 1/2 = {1− ρ1(x)}η(x) + ρ0(x){1− η(x)} − 1/2

= {η(x)− 1/2}{

1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)

2η(x)− 1

}.

But, writing t := η(x)− 1/2,

1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)

2η(x)− 1

= 1− g(1/2 + t)− g(1/2− t) +g(1/2 + t)− g(1/2− t)

2tt→0→ 1− 2g(1/2) + g′(1/2).


Estimated regret ratios

Model: X|Y = r ∼ N5(µr, I5), where µ1 = (3/2, 0, 0, 0, 0)T = −µ0, π1 = 0.5.

Labels: Let g(1/2 + t) = 0 ∨min{g0(1 + h0t), 2g0}, then set ρ0(x) = g(η(x))

and ρ1(x) = g(1− η(x)).

4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

1.0

1.5

2.0

2.5

log(n)

Re

gre

t R

atio

g0 h0 Asymptotic RR0.1 −1 1.37

0.1 0 1.22

0.1 1 1.10

0.1 2 1

0.1 3 0.92


Support Vector Machines

LetH denote an RKHS, and let L(y, t) := max{0, 1− (2y − 1)t} denote thehinge loss function. The SVM classifier is given by

CSVM(x) :=

{1 if f(x) ≥ 0

0 otherwise,

where

f ∈ argminf∈H

{1

n

n∑i=1

L(Yi, f(Xi)) + λ‖f‖2H}.

We focus on the case whereH has the Gaussian radial basis reproducing kernelfunction K(x, x′) := exp(−σ2‖x− x′‖2), for σ > 0.


SVM asymptotic analysis

If PX is compactly supported and λ = λn is chosen appropriately then thisSVM classifier is consistent in the uncorrupted labels case (Steinwart, 2005).

Corollary. Assume the conditions of our lemma, and suppose that PX iscompactly supported. If λ = λn → 0 but nλn

| log λn|d+1 →∞, then

R(CSVM)−R(CBayes)→ 0

as n→∞.


SVM assumptions

1. We say that the distribution P satisfies the margin assumption withparameter γ1 ∈ [0,∞) if there exists κ1 > 0 such that

PX({x ∈ Rd : 0 < |η(x)− 1/2| ≤ t}) ≤ κ1tγ1

for all t > 0.

2. Let S+ := {x ∈ Rd : η(x) > 1/2} and S− := {x ∈ Rd : η(x) < 1/2}, andfor x ∈ Rd, let τx := infx′∈S∪S+ ‖x− x′‖+ infx′∈S∪S− ‖x− x′‖. Say Phas geometric noise exponent γ2 ∈ [0,∞) if there exists κ2 > 0, such that∫

Rd|2η(x)− 1| exp

(τ2x

t2

)dPX(x) ≤ κ2t

γ2d

for all t > 0.


Rate of convergence

With perfect labels and when PX(B(0, 1)

)= 1, the excess risk of the SVM

classifier is O(n−Γ+ε) for every ε > 0, where

Γ :=

{γ2

2γ2+1 if γ2 ≤ γ1+22γ1

2γ2(γ1+1)2γ2(γ1+2)+3γ1+4 otherwise.

(Steinwart and Scovel, 2007).

Theorem. Suppose that P has margin parameter γ1 ∈ [0,∞], geometric noiseexponent γ2 ∈ (0,∞) and PX

(B(0, 1)

)= 1. Assume the conditions of the

lemma and that ρ0(x) = g(η(x)), ρ1(x) = g(1− η(x)), where g : (0, 1)→ [0, 1)

is di�erentiable at 1/2.

Let λ = λn := n−(γ2+1)/(γ2Γ) and σ = σn := nΓ/(γ2d). Then

R(CSVM)−R(CBayes) = O(n−Γ+ε),

as n→∞, for every ε > 0.


Linear Discriminant Analysis

Suppose that Pr = Nd(µr,Σ) for r = 0, 1. Then

CBayes(x) =

{1 if log

(π1

π0

)+(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) ≥ 0

0 otherwise.

Define

CLDA(x) :=

{1 if log

(π1

π0

)+(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) ≥ 0

0 otherwise,

where, πr := n−1∑ni=1 1{Yi=r}, µr :=

∑ni=1Xi1{Yi=r}/

∑ni=1 1{Yi=r}, and

Σ :=1

n− 2

n∑i=1

1∑r=0

(Xi − µr)(Xi − µr)>1{Yi=r}.


LDA asymptotic analysis

Theorem. Assume we have ρ-homogeneous noise (ρ < 1/2) and suppose thatPr = Nd(µr,Σ), for r = 0, 1. Then

limn→∞

CLDA(x) =

{1 if c0 +

(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) > 0

0 if c0 +(x− µ0+µ1

2

)>Σ−1(µ1 − µ0) < 0,

where c0 can be expressed in terms of ∆2 := (µ1 − µ0)TΣ−1(µ1 − µ0), ρand π1. As a consequence,

limn→∞

R(CLDA) = π0Φ

(c0∆− ∆

2

)+ π1Φ

(−c0

∆− ∆

2

)≥ R(CBayes), (1)

with equality if π0 = π1 = 1/2. Moreover, for each ρ ∈ (0, 1/2) and π0 6= π1,there is a unique value of ∆ > 0 for which we have equality in (1).


LDA with ρ-homogeneous noise

5 6 7 8

51

01

5

log(n)

Err

or

Here, X|{Y = r} ∼ N5(µr, I5), where µ1 = ( 32 , 0, . . . , 0)> = −µ0 ∈ R5, and

π1 = 0.9.

No label noise (black), ρ-homogeneous noise for ρ = 0.1 (red), 0.2 (blue), 0.3(green) and 0.4 (purple). The do�ed lines show our asymptotic limit.


Summary

I The knn and SVM classifiers remain consistent with label noise under mildassumptions on the noise mechanism and data distribution.

I Under stronger conditions, the rate of convergence of the excess risk forthese classifiers is preserved.

I However, the LDA classifier is typically not consistent, unless the classpriors are equal (even with homogeneous noise).

Main reference:

I Cannings, T. I., Fan, Y. and Samworth, R. J. (2018) Classification withimperfect training labels. https://arxiv.org/abs/1805.11505.


Other referencesI Cannings, T. I., Berre�, T. B. and Samworth, R. J. (2018) Local nearest neighbour

classification with applications to semi-supervised learning.https://arxiv.org/abs/1704.00642v2.

I Frénay, B. and Kabán, A. (2014) A comprehensive introduction to label noise. Proc.Euro. Sym. Artificial Neural Networks, 667–676.

I Frénay, B. and Verleysen, M. (2014) Classification in the presence of label noise: asurvey. IEEE Trans. on NN and Learn. Sys., 25, 845–869.

I Ghosh, A., Manwani, N. and Sastry, P. S. (2015) Making risk minimization tolerantto label noise. Neurocomputing, 160, 93–107.

I Lachenbruch, P. A. (1966) Discriminant analysis when the initial samples aremisclassified. Technometrics, 8, 657–662.

I Okamoto, S. and Nobuhiro, Y. (1997) An average-case analysis of the k-nearestneighbor classifier for noisy domains. In Proc. 15th Int. Joint Conf. Artif. Intell., 1,238–243.

I Steinwart, I. (2005) Consistency of support vector machines and other regularizedkernel classifiers. IEEE Trans. Inf. Th., 51, 128–142.

I Steinwart, I. and Scovel, C. (2007) Fast rates for support vector machines usingGaussian kernels. Ann. Statist., 35, 575–607.


Documents

Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay