94
Nearest Neighbors I: Regression and Classification Kamalika Chaudhuri University of California, San Diego

Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Nearest Neighbors I: Regression and Classification

Kamalika Chaudhuri

University of California, San Diego

Page 2: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Talk Outline

• Part I: k-Nearest neighbors: Regression and Classification

• Part II: k-Nearest neighbors (and other non-parametrics): Adversarial examples

Page 3: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k Nearest Neighbors

Given: training data (x1, y1), …, (xn, yn) in X x {0, 1}

Predict y for x from the k closest neighbors of x among xi

query point x

Example:

k-NN classification: predict majority label of k closest neighbors

k-NN regression: predict average label of k closest neighbors

Page 4: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

The Metric Space

Data points lie in metric space with distance function d

Examples:

X = RD, d = Euclidean distance X = RD, d = lp distanceMetric based on user preferences

d(x, x’)

x

x’

Page 5: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Notation

X(i)(x) = i-th nearest neighbor of x

Y(i)(x) = label of X(i)(x)x

X(2)(x)

Page 6: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Tutorial Outline

• Nearest Neighbor Regression• The Setting• Universal Consistency• Rates of Convergence

• Nearest Neighbor Classification• The Statistical Learning Framework• Consistency• Rates of Convergence

Page 7: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

NN Regression Setting

Compact metric space (X, d)

Uniform measure on X (for now)µ

Page 8: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

NN Regression Setting

Compact metric space (X, d)

Uniform measure on X (for now)µ

Input: Training data (x1, y1), …, (xn, yn)

where: xi ⇠ µ

yi = f(xi) + noise

unknown f

Page 9: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

NN Regression Setting

Compact metric space (X, d)

Uniform measure on X (for now)µ

Input: Training data (x1, y1), …, (xn, yn)

where: xi ⇠ µ

yi = f(xi) + noise

unknown f

k-NN Regressor: f̂k(x) =1

k

kX

i=1

Y

(i)(x)

Page 10: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universality

k-NN Regressor: f̂k(x) =1

k

kX

i=1

Y

(i)(x)

Input: Training data (x1, y1), …, (xn, yn)

where yi = f(xi) + noise

What f can k-NN regression represent?

Page 11: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universality

k-NN Regressor: f̂k(x) =1

k

kX

i=1

Y

(i)(x)

Input: Training data (x1, y1), …, (xn, yn)

where yi = f(xi) + noise

What f can k-NN regression represent?

Answer: Any f, provided k grows suitably with n

[Devroye, Gyorfi, Kryzak, Lugosi, 94]

Page 12: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

More Formally…

kn NN Regression: when k grows with n

Theorem: If and if , then for any f, kn ! 1 kn/n ! 0

EX⇠µ[|f(X)� f̂kn(X)|] ! 0as n ! 1

kn NN Regression is universally consistent

Page 13: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Intuition: Universal Consistency

As n grows, X(i)(x) move closer to x (continuous )µ

x

Page 14: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Intuition: Universal Consistency

As n grows, X(i)(x) move closer to x (continuous )µ

If kn is constant or grows slowly then (kn/n ! 0) X

(i)(x) ! x, i kn

x

Page 15: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Intuition: Universal Consistency

As n grows, X(i)(x) move closer to x (continuous )µ

If kn is constant or grows slowly then (kn/n ! 0) X

(i)(x) ! x, i kn

If f is continuous, then f(X(i)(x)) ! f(x), 1 i kn

x

Page 16: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Intuition: Universal Consistency

As n grows, X(i)(x) move closer to x (continuous )µ

If kn is constant or grows slowly then (kn/n ! 0) X

(i)(x) ! x, i kn

If f is continuous, then f(X(i)(x)) ! f(x), 1 i kn

1

kn

knX

i=1

(f(X

(i)(x)) + noise) ! f(x)If , then kn ! 1

x

Page 17: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Intuition: Universal Consistency

As n grows, X(i)(x) move closer to x (continuous )µ

If kn is constant or grows slowly then (kn/n ! 0) X

(i)(x) ! x, i kn

If f is continuous, then f(X(i)(x)) ! f(x), 1 i kn

1

kn

knX

i=1

(f(X

(i)(x)) + noise) ! f(x)If , then kn ! 1

Any f can be approximated arbitrarily well by continuous f

x

Page 18: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Tutorial Outline

• Nearest Neighbor Regression• The Setting• Universality• Rates of Convergence

• Nearest Neighbor Classification• The Statistical Learning Framework• Consistency• Rates of Convergence

Page 19: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rates

Definition: f is L-Lipschitz if for all x and x’,

|f(x)� f(x0)| L · d(x, x0)

Page 20: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rates

Definition: f is L-Lipschitz if for all x and x’,

|f(x)� f(x0)| L · d(x, x0)

Theorem: If f is L-Lipschitz then for , kn = ⇥(n2/(2+D))

there exists a constant C such that

Ex⇠µ

[kf̂k

(x)� f(x)k2] Cn

�2/(2+D) (D = data dim)

Page 21: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rates

Definition: f is L-Lipschitz if for all x and x’,

|f(x)� f(x0)| L · d(x, x0)

Theorem: If f is L-Lipschitz then for , kn = ⇥(n2/(2+D))

there exists a constant C such that

Ex⇠µ

[kf̂k

(x)� f(x)k2] Cn

�2/(2+D)

Better bounds for low intrinsic dimension [Kpotufe11]

is the optimal value of knkn = ⇥(n2/(2+D))

(D = data dim)

Page 22: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

How fast is convergence?

- How small are k-NN distances?

- From distances to convergence rates

Page 23: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define:rk(x) = d(x,X(k)(x))

x

X(2)(x)

How small is rk(x)?

Page 24: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x)) x

X(k)(x)

Bx

Page 25: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x))

Let Bx = Ball(x, rk(x))

x

X(k)(x)

Bx

Page 26: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x))

Let Bx = Ball(x, rk(x))

x

X(k)(x)

Bx

(whp for large k, n)µ̂(Bx

) = k/n ⇡ µ(Bx

)

Page 27: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x))

Let Bx = Ball(x, rk(x))

x

X(k)(x)

Bx

(whp for large k, n)µ̂(Bx

) = k/n ⇡ µ(Bx

)

µ(Bx

) =

Z

B

x

µ(x0)dx0 ⇡ µ(x)

Z

B

x

dx

0 ⇡ µ(x)rk

(x)D

Page 28: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

Given i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x))

Let Bx = Ball(x, rk(x))

x

X(k)(x)

Bx

(whp for large k, n)µ̂(Bx

) = k/n ⇡ µ(Bx

)

rk(x) ⇡✓

1

µ(x)· kn

◆1/D

µ(Bx

) =

Z

B

x

µ(x0)dx0 ⇡ µ(x)

Z

B

x

dx

0 ⇡ µ(x)rk

(x)D

(D = data dimension)

Page 29: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

k-NN Distances

x

X(k)(x)

BxGiven i.i.d. x1, . . . , xn ⇠ µ

Define: rk(x) = d(x,X(k)(x))

rk(x) ⇡✓

1

µ(x)· kn

◆1/D (Curse ofdimensionality)

Better for data with low intrinsic dimension

[Kpotufe, 2011], [Samworth 12], [Costa and Hero 04]

Page 30: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

From Distances to Rates

1. Bias-Variance Decomposition

2. Bound Bias and Variance in terms of distances

3. Integrate over the space

Page 31: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Bias-Variance Decomposition

For a fixed x, and {xi}, define:

f̃k(x) =1

k

kX

i=1

E[Y (i)(x)|{xi}]

Then:

E[kfk(x)� f(x)k2] = E[kf̃k(x)� f(x)k2] + E[kfk(x)� f̃k(x)k2]

Bias Variance{ {

Page 32: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Bounding Bias and Variance

Bounding bias: For any x,

kf̃k(x)� f(x)k2 1

k

kX

i=1

|f(x)� f(X(i)(x)|!2

Page 33: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Bounding Bias and Variance

Bounding bias: For any x,

kf̃k(x)� f(x)k2 1

k

kX

i=1

|f(x)� f(X(i)(x)|!2

(L · d(x,X(k)(x)))2 (by Lipschitzness)

Page 34: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Bounding Bias and Variance

Bounding bias: For any x,

kf̃k(x)� f(x)k2 1

k

kX

i=1

|f(x)� f(X(i)(x)|!2

(L · d(x,X(k)(x)))2 (by Lipschitzness)

✓k

n

◆2/D

(from distances)

Page 35: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Bounding Bias and Variance

Bounding variance:

E[kfk(x)� f̃k(x)k2] = E✓1

k

(Y (i)(x)� E[Y (i)(x)])2◆

=�

2Y

k

Bounding bias: For any x,

kf̃k(x)� f(x)k2 1

k

kX

i=1

|f(x)� f(X(i)(x)|!2

(L · d(x,X(k)(x)))2 (by Lipschitzness)

✓k

n

◆2/D

(from distances)

Page 36: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Integrating across the space

E[kfk(x)� f(x)k2] = E[kf̃k(x)� f(x)k2] + E[kfk(x)� f̃k(x)k2]

Bias Variance{ {

/ 1

k+

✓k

n

◆2/D

Page 37: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Integrating across the space

E[kfk(x)� f(x)k2] = E[kf̃k(x)� f(x)k2] + E[kfk(x)� f̃k(x)k2]

Bias Variance{ {

/ 1

k+

✓k

n

◆2/D

Optimizing for k: kn = ⇥(n2/(2+D))

Page 38: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Integrating across the space

E[kfk(x)� f(x)k2] = E[kf̃k(x)� f(x)k2] + E[kfk(x)� f̃k(x)k2]

Bias Variance{ {

/ 1

k+

✓k

n

◆2/D

Optimizing for k: kn = ⇥(n2/(2+D))

Which leads to: E[kfk(x)� f(x)k2] n

�2/(2+D)

Bound is optimal, better for low intrinsic dimension

Page 39: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Tutorial Outline

• Nearest Neighbor Regression• The Setting• Universality• Rates of Convergence

• Nearest Neighbor Classification• The Statistical Learning Framework• Consistency• Rates of Convergence

Page 40: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Nearest Neighbor Classification

Given: training data (x1, y1), …, (xn, yn) in X x {0, 1}

Predict majority label of the k closest points closest to x

query point x

hn,k = k-NN classifier on n points

hn,k(x) = 0, if 1

k

kX

i=1

Y

(i)(x) 1

2

= 1, otherwise

Page 41: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

The Statistical Learning Framework

Metric space (X, d)

Underlying measure on X from which points are drawnµ

Label of x is a coin flip with bias ⌘(x) = Pr(y = 1|x)

Risk or error of a classifier h: R(h) = Pr(h(X) 6= Y )

Goal: Find h that minimizes risk or maximizes accuracy

Accuracy(h) = 1 - R(h)

Page 42: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

The Bayes Optimal Classifier

Risk(h) = EX [min(⌘(X), 1� ⌘(X))] = R*

The Bayes Optimal Classifier minimizes risk

h(x) = 0, if

1, otherwise

⌘(x) 1/2{

Page 43: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Tutorial Outline

• Nearest Neighbor Regression• The Setting• Universality• Rates of Convergence

• Nearest Neighbor Classification• The Statistical Learning Framework• Consistency• Rates of Convergence

Page 44: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Consistency

Does R(hn,k) converge to R* as n goes to infinity?

Page 45: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Consistency of 1-NN

Assume:

Continous Absolutely continuous

µx

Page 46: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Consistency of 1-NN

Assume:

Continous Absolutely continuous

µ

R(hn,1) ! EX [2⌘(X)(1� ⌘(X))] 6= R⇤

x

Page 47: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Consistency of 1-NN

Assume:

Continous Absolutely continuous

µ

R(hn,1) ! EX [2⌘(X)(1� ⌘(X))] 6= R⇤

x

1-NN is inconsistent

k-NN for constant k is also inconsistent

[Cover and Hart, 67]

Page 48: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Assume:

For any x, X(1)(x) converges to x

Continous Absolutely continuous

µx

Page 49: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Assume:

For any x, X(1)(x) converges to x

Continous Absolutely continuous

By continuity, ⌘(X(1)(x)) ! ⌘(x)

µx

Page 50: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Assume:

For any x, X(1)(x) converges to x

Continous Absolutely continuous

By continuity, ⌘(X(1)(x)) ! ⌘(x)

Pr(Y (1)(x) 6= y) = ⌘(x)(1� ⌘(X(1)(x)) + ⌘(X(1)(x))(1� ⌘(x))

! 2⌘(x)(1� ⌘(x))

µx

Page 51: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Assume:

For any x, X(1)(x) converges to x

Continous Absolutely continuous

By continuity, ⌘(X(1)(x)) ! ⌘(x)

Pr(Y (1)(x) 6= y) = ⌘(x)(1� ⌘(X(1)(x)) + ⌘(X(1)(x))(1� ⌘(x))

! 2⌘(x)(1� ⌘(x))

µ

Thus: R(hn,1) ! EX [2⌘(X)(1� ⌘(X))] 6= R⇤

x

Page 52: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Consistency under Continuity

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

[Fix and Hodges’51, Stone’77, Cover and Hart 65,67,68]

Page 53: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

Page 54: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

X(1)(x), …, X(kn)(x) lie inProof:a ball of prob. mass ⇡ kn/n

Page 55: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

X

(1)(x), . . . , X(kn)(x) ! x

X(1)(x), …, X(kn)(x) lie inProof:a ball of prob. mass ⇡ kn/n

Page 56: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

X

(1)(x), . . . , X(kn)(x) ! x

By continuity, ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

X(1)(x), …, X(kn)(x) lie inProof:a ball of prob. mass ⇡ kn/n

Page 57: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition

Theorem: If and if , thenkn ! 1 kn/n ! 0

Assume is continuous⌘

R(hn,kn) ! R⇤ as n ! 1

X

(1)(x), . . . , X(kn)(x) ! x

By continuity, ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

As kn grows, 1

kn

knX

i=1

Y

(i)(x) ! ⌘(x)

X(1)(x), …, X(kn)(x) lie inProof:a ball of prob. mass ⇡ kn/n

Page 58: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

For any bounded measurable f,

Let (X, d, ) be a separable metric measure space Theorem:

where the Lebesgue differentiation property holds:

µ

limr#0

1

µ(B(x, r))

Z

B(x,r)fdµ = f(x)

for almost all -a.e x in Xµ

Page 59: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

For any bounded measurable f,

Let (X, d, ) be a separable metric measure space Theorem:

where the Lebesgue differentiation property holds:

µ

limr#0

1

µ(B(x, r))

Z

B(x,r)fdµ = f(x)

If and then in probabilitykn ! 1 kn/n ! 0 R(hn,kn) ! R⇤

If in addition then almost surely R(hn,kn) ! R⇤kn/ log n ! 0

for almost all -a.e x in Xµ

[Chaudhuri and Dasgupta, 14]

Page 60: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Page 61: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Earlier continuity argument: ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

Page 62: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Earlier continuity argument: ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

It suffices that avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) ! ⌘(x)

Page 63: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Earlier continuity argument: ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

It suffices that avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) ! ⌘(x)

X(1)(x), .., X(kn)(x) lie in some ball B(x, r). For suitable r,

they are random draws from restricted to B(x, r)µ

Page 64: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Earlier continuity argument: ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

It suffices that avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) ! ⌘(x)

X(1)(x), .., X(kn)(x) lie in some ball B(x, r). For suitable r,

they are random draws from restricted to B(x, r)µ

avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) is close to avg in B(x, r) ⌘

Page 65: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Universal Consistency in Metric Spaces

limr#0

1

µ(B(x, r))

Z

B(x,r)⌘dµ = ⌘(x)

Since kn/n ! 0 ,X(1)(x), . . . , X(kn)(x) ! x

Earlier continuity argument: ⌘(X(1)(x)), . . . , ⌘(X(kn)(x)) ! ⌘(x)

It suffices that avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) ! ⌘(x)

X(1)(x), .., X(kn)(x) lie in some ball B(x, r). For suitable r,

they are random draws from restricted to B(x, r)µ

avg(⌘(X(1)(x)), . . . , ⌘(X(kn)(x))) is close to avg in B(x, r) ⌘

As n grows, this ball shrinks. Thus it is enough that

Page 66: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Tutorial Outline

• Nearest Neighbor Regression• The Setting• Universality• Rates of Convergence

• Nearest Neighbor Classification• The Statistical Learning Framework• Consistency• Rates of Convergence

Page 67: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Main Idea in Prior Analysis

Smoothness of µ Small rk(x)

Lipschitzness of ⌘ ⌘(X(k)(x)) ⇡ ⌘(x)

Neither smoothness nor Lipschitzness matter!

[Chaudhuri and Dasgupta’14]

Page 68: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

A Motivating Example

Property of interest:

Balls of probability mass approx. k/n around x

where x is close to the decision boundary

Page 69: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Some Notation

Probability-radius rp(x):

rp(x) = inf{r|µ(B(x, r)) � p}

Conditional probability for a set:

⌘(A) =1

µ(A)

Z

A⌘dµ

x

rp(x)

µ(B(x, rp(x))) � p

B(x, rp(x))

Page 70: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Effective Interiors and Boundaries

p

(p,�)Interior

DecisionBoundary

Positive Interior:

X+p,� = {x|⌘(x) � 1/2,

⌘(B(x, r)) � 1/2 +�,

for all r rp(x)}@p,�

Page 71: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Effective Interiors and Boundaries

p

(p,�)Interior

DecisionBoundary

Positive Interior:

X+p,� = {x|⌘(x) � 1/2,

⌘(B(x, r)) � 1/2 +�,

for all r rp(x)}

Similarly Negative Interior

@p,�

Page 72: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Effective Interiors and Boundaries

p

(p,�)Interior

DecisionBoundary

Positive Interior:

X+p,� = {x|⌘(x) � 1/2,

⌘(B(x, r)) � 1/2 +�,

for all r rp(x)}

Similarly Negative Interior

-Interior:(p,�) X+p,� [ X�

p,�

@p,�

Page 73: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Effective Interiors and Boundaries

p

(p,�)Interior

DecisionBoundary

Positive Interior:

X+p,� = {x|⌘(x) � 1/2,

⌘(B(x, r)) � 1/2 +�,

for all r rp(x)}

Similarly Negative Interior

-Interior:(p,�) X+p,� [ X�

p,�

-Boundary:(p,�) @p,� = X \ (X+p,� [ X�

p,�)

@p,�

Page 74: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rate Theorem

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k R⇤ + � + µ(@p,�)

p

(p,�)Interior

DecisionBoundary@p,�

Page 75: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rate Theorem

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k R⇤ + � + µ(@p,�)

for:

p =

k

n· 1

1�p

(4/k) log(2/�)

� = min

1

2

,

rlog(2/�)

k

!

p

(p,�)Interior

DecisionBoundary@p,�

Page 76: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

Page 77: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

Page 78: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

Page 79: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

3.

�����1

|B|X

i

Yi · 1(Xi 2 B)� ⌘(B)

����� � �

Page 80: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

3.

�����1

|B|X

i

Yi · 1(Xi 2 B)� ⌘(B)

����� � �

If (1) does not hold, say

Then ⌘(B) � 1/2 +�

⌘(x) � 1/2

Page 81: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 1

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

3.

�����1

|B|X

i

Yi · 1(Xi 2 B)� ⌘(B)

����� � �

If (1) does not hold, say

Then ⌘(B) � 1/2 +�

Either k-th NN of x lies

outside B or (3) holds

⌘(x) � 1/2

Page 82: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 2

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

3.

�����1

|B|X

i

Yi · 1(Xi 2 B)� ⌘(B)

����� � �

p =

k

n· 1

1�p

(4/k) log(2/�)

If

then, the probability

of (2) is at most �/2

(Chernoff bounds)

Page 83: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Proof Intuition 3

x

rp(x)

B(x, rp(x))

For fixed x, let B = B(x, rp(x))

If then:hn,k(x) 6= h(x)

1. x 2 @p,�

2. d(x,X(k)(x)) > rp(x)

3.

�����1

|B|X

i

Yi · 1(Xi 2 B)� ⌘(B)

����� � �

If

� = min

1

2

,

rlog(2/�)

k

!

then, the probability

of (3) is at most �/2

(Chernoff bounds)

Page 84: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Putting it all together…

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k Pr(h(x) 6= y) + Pr(x 2 @p,�) + Pr(2.) + Pr(3.)

Page 85: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Putting it all together…

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k Pr(h(x) 6= y) + Pr(x 2 @p,�) + Pr(2.) + Pr(3.)

Pr(h(x) 6= y) = R

⇤ (By definition)

Page 86: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Putting it all together…

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k Pr(h(x) 6= y) + Pr(x 2 @p,�) + Pr(2.) + Pr(3.)

Pr(h(x) 6= y) = R

⇤ (By definition)

Pr(x 2 @p,�) = µ(@p,�)

Page 87: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Putting it all together…

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k Pr(h(x) 6= y) + Pr(x 2 @p,�) + Pr(2.) + Pr(3.)

p =

k

n· 1

1�p

(4/k) log(2/�)� = min

1

2

,

rlog(2/�)

k

!

If and

Pr(2.) + Pr(3.) �then

Pr(h(x) 6= y) = R

⇤ (By definition)

Pr(x 2 @p,�) = µ(@p,�)

Page 88: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Convergence Rate Theorem

Risk Rn,k of the k-NN classifier based on n training examples is:

Rn,k R⇤ + � + µ(@p,�)

for:

p =

k

n· 1

1�p

(4/k) log(2/�)

� = min

1

2

,

rlog(2/�)

k

!

p

(p,�)Interior

DecisionBoundary@p,�

Page 89: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Smoothness

is -Holder continuous if for constant L, all x, x’,⌘ ↵

|⌘(x)� ⌘(x0)| Lkx� x

0k↵

Margin: For constant C, for any t,

µ({x| |⌘(x)� 1/2| t}) Ct

The above two conditions plus is supported on a regular set with

µµmin

µ µmax

Then E[R] - R* is ⇥(n�↵(�+1)/(2↵+d))

Also achieved by k-NN for suitable k

Page 90: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

A Better Smoothness Condition

More natural notion:Relate smoothness toµ(kx� x

0k)

is -smooth if for some constant L, for all x, r > 0,

|⌘(x)� ⌘(B(x, r))| Lµ(B(x, r))↵

⌘ ↵

Page 91: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Smoothness Bounds

⌘Suppose is -smooth. Then for any n, k,

Pr(hn,k(X) 6= h(X)) � + µ

{x| |⌘(x)� 1/2| C1

r1

k

log

1

!With probability , � 1� �

For k / n2↵/(2↵+1)

Lower Bounds: With constant probability,

Pr(hn,k(X) 6= h(X)) � C2µ

{x| |⌘(x)� 1/2| C3

r1

k

}!

Page 92: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Implications

1. Recovers previous bounds on smooth functions with margin conditions

2. Faster rates for special cases

- Zero Bayes Risk: 1-NN has the best rates

- Bounded away from 0: Exponential convergence�

Page 93: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Conclusion

1. kn-NN is always universally consistent provided k grows a certain way with n

2. k-NN regression suffers from curse of dimensionality

3. k-NN classification also does, but can do better

Page 94: Nearest Neighbors I: Regression and Classification · k Nearest Neighbors Given: training data (x 1, y 1), …, (x n, y n) in X x {0, 1} Predict y for x from the k closest neighbors

Acknowledgements

Thanks to Sanjoy Dasgupta and Samory KpotufeA chunk of this talk is based on a tutorial from ICML 2018 by Sanjoy and Samory