Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Advanced Introduction to Machine Learning— Spring Quarter, Week 7 —

https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

May 11th/13th, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F1/57 (pg.1/188)

Logistics Review

Announcements

HW3 is due May 15th, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).


Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F3/57 (pg.3/188)

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

x6

x3

x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3

x2x4

x5

x1

x3

x2x4

x1

x3

x2

x1

x2

x1

x3

x2x4

x5

x1

x2x4

x5

x1

x2

x5

x1

x2

=X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x4

X

x5

(x1, x2)� 66,2(x2)X

x3

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

=X

x5

(x1, x2)� 66,2(x2)X

x4

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

x5

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·X

x6

p(x1, x2, . . . , x6)

X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x3

X

x4

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

x5

(x3, x5)

| {z }� 65,3(x3)

=X

x3

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

x4

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

x3

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·X

x6

p(x1, x2, . . . , x6)

=

Reconstituted Graph Reconstituted Graph

O(r2)

O(r4)

O(r3)

O(r2)

x6

x5

x4

x3

O(r2)

O(r2)

O(r2)

O(r2)

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.


Logistics Review

Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance isp

m/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).


Logistics Review

The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho


Logistics Review

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0 < m.

Core advantage: If U ={1, 2, . . . , m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.


Logistics Review

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0 < m dimensions, sox(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i) � x(i)k22 (7.1)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the ith largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m ⇥ m0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)


Logistics Review

Example: 2D and two principle directions


Logistics Review

PCA vs. LDA

From Bishop 2006

PCA vs. Linear Discriminant Analysis (LDA).


Logistics Review

Axis parallel projection vs. LDA

From Hastie et. al., 2009

+

++

+

Axis parallel (left) vs. LDA (right).


Logistics Review

Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2exp

✓�1

2(x � µy)

|C�1y (x � µy)

◆(7.2)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �1

2(x � µ1)

|C�11 (x � µ1) +

1

2(x � µ0)

|C�10 (x � µ0)

(7.3)

= (C�1µ1 � C�1µ0)|x + µ0

|C�1µ0 � µ1|C�1µ1 (7.4)

= ✓|x + c (7.5)

✓ is a projection (m ⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.


Logistics Review

Linear Discriminant Analysis (LDA), ` classesWithin class scatter matrix

SW =X

c=1

Sc where Sc =X

i:i is class c

(x(i) � µc)(x(i) � µc)

|(7.4)

and µc is the class c mean, µc = 1nc

Pi is of class c x(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (7.5)

First projection, maximize ratio of variances (recall PCA):a1 2 argmaxa2Rm

a|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S�1

W SB to get LDA reduction.LDA dimensionality reduction: find `0 ⇥m matrix linear transform on x,project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F13/57 (pg.13/188)

Logistics Review

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1 � e�m✏2/2 (7.11)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.


Logistics Review

k-NN classifier, multiclass data

Let D =�x(i), y(i)

ni=1

be a training set of n samples (as always).

Let Nk(x, D) ✓ D be the subset of D of size k which are the k nearestneighbors of x. That is |Nk(x, D)| = k and for any(x(i), y(i)) 2 Nk(x, D) we have that d(x(i), x) is no more than the kth

farthest data point in D from x, under distance d(·, ·).We can then estimate a posterior probability in binary case:

pkNN(y = c|x) =1

k

X

(x(i),y(i))2Nk(x,D)

1{y(i)=c} (7.15)

Valid probability 0 pkNN(y = c|x) 1 andP`

c=1 pkNN(y = c|x) = 1.


Logistics Review

k-NN classifier, various k


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

..

. .. .. .. .. . .. . .. . .. . . . .. . . . . .. . . . . . .. . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . . . . . ............... ................. .................. .................... ....................... ......................... .......................... .......................... ............................ ............................. ............................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

ooo

o

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

ooo

oo oo

o

o

o

o

o

o

o

o

o

o

k-NN classifier, with k = 15

.. .. .. . . . . .. . . . . . .. . . . . . . . .. . . . . . . . ............ .............. ............... .................. ................... .................... ..................... .......................... ......................... ........................ ........................ ........................ .......................... .......................... .......................... ........................... ............................ .............................. ................................ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................. .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . ..................... ...................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . ....................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . ............. . . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . . ................. ................... .................... ................... .............. . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . .

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

ooo

o

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

ooo

oo oo

o

o

o

o

o

o

o

o

o

o

k-NN classifier, with k = 1


Logistics Review

k-NN classifiers, e↵ective degrees of freedom

Hyperparameter is k (similar to � in ridge or lasso).A useful measure is n/k which is e↵ective degrees of freedom in ak-NN classifier. n/k is e↵ective number of neighborhoods within whichwe might fit a mean.

We can look at train/testerror as a function of n/kand consider relative to lo-gistic regression.

'HJUHHV�RI�)UHHGRP�ï�Q�N

Test

Erro

r

0.10

0.15

0.20

0.25

0.30

2 3 5 8 12 18 29 67 200

151 101 69 45 31 21 11 7 5 3 1

7UDLQTestBayes

/LQHDU



k-NN LSH DTs

1D manifolds in 2D ambient space

One-dimensional spiral manifold in 2D space

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2


k-NN LSH DTs

1D manifolds in 2D ambient space, two classes

One-dimensional spiral manifold in 2D space, along with PCA projection.

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2


k-NN LSH DTs

1D manifolds in 2D ambient space

One-dimensional spiral manifold in 2D space, along with PCA projection, 2class case.

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2


k-NN LSH DTs

2D Manifold Examples in 3d Ambient Space

(A): manifolds. (B) samples from manifold. (C) Inherent flattened 2Dmanifolds.

From: Roweiss & Saul, 1999


k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.


k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.


k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs








k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.

With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.


k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!




k-NN LSH DTs



Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R

From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”



k-NN LSH DTs



Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R




k-NN LSH DTs

Recall: Bayes error

Recall next two slides from lecture 4.


k-NN LSH DTs

0/1-loss and Bayes error

Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.

Probability of error for a given x

Pr(h✓(x) 6= Y ) =

Zp(y|x)1{h✓(x) 6=y}dy = 1 � p(h✓(x)|x) (7.7)

To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.

Smallest probability of error is known is Bayes error for x

Bayes error(x) = miny

(1 � p(y|x)) = 1 � maxy

p(y|x) (7.8)

Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,

BayesPredictor(x) , argmaxy

p(y|x) (7.9)

Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F24/57 (pg.37/188)

k-NN LSH DTs

Bayes Error, overall di�culty of a problem

For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.

Bayes error, various equivalent forms (Exercise: show last equality):

Bayes Error = minh

Zp(x, y)1{h(x) 6=y}dydx (7.13)

= minh

Pr(h(X) 6= Y ) (7.14)

= Ep(x)[min(p(y|x), 1 � p(y|x))] (7.15)

=1

2� 1

2Ep(x)[|2p(y|x) � 1|] (7.16)

For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).

Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x0.


k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1.

So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.


k-NN LSH DTs



{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.

Xn ! X in the rth mean (r � 1), (written Xnr�! X) if



p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)


p(Xn x) ! P (X < x) as n ! 1 (7.4)



k-NN LSH DTs



{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if



p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)


p(Xn x) ! P (X < x) as n ! 1 (7.4)



k-NN LSH DTs



{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)


r�! X) if



p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)


p(Xn x) ! P (X < x) as n ! 1 (7.4)



k-NN LSH DTs



{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)


r�! X) if



p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)


p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.

a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.


k-NN LSH DTs



{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)


r�! X) if



p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)


p(Xn x) ! P (X < x) as n ! 1 (7.4)



F a sequin Ot r. V. s

1) X,

X, .tn, tis, . . . . . .

How might we measureh-w

Xu approach, X ?

There are four standard

woes :

27

b) AS.

⇒pros →dist

rtt

4)

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).

Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).


k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.

Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.

✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.

h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).

Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).

We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs


⇥R(h✓(Dn)|Dn)

⇤improves with n.







k-NN LSH DTs


⇥R(h✓(Dn)|Dn)

⇤improves with n.







BT

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?

As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.


k-NN LSH DTs


Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).

Universality means that we can expect this to be true regardless of thedistribution.


Stone showed that:




k-NN LSH DTs


Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.


Stone showed that:




k-NN LSH DTs



Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.

Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:




k-NN LSH DTs



Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!

Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:




k-NN LSH DTs




Stone showed that:




k-NN LSH DTs



Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:




k-NN LSH DTs



Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:




k-NN LSH DTs

k-NN and Bayes error

3-class example for various k and comparing to Bayes error (purple dashedlines).From Hastie et. al., 2009

15-Nearest Neighbors

. ........ ......... ..... ...... ......... ......................................................................................... ............................................................................................................................................................................................................................................................................ ........................................................................... .............................................................................. .. . .... .... ......................................... .................................................................. ..................................................................... ..................................................................................................................................................................................... ................................................................................................................................................ .............................................. .. .................................................................................... . .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... . .................................................................................................................. .................................................................................................................................................................................................................................. ..................................................................................................... ...... ..................................................................................................... ..... ..... ............................................................................................... .... ...................................................................................................... ...... ............................................................................................................ ........................................................................................................... ....................................................................... .................................. .............................................................. .... . . ........................ ............................................................. ........................ .................................................... ................. ............................................. ....... ... ........................................................................................................ .............................. . ...................... ......

.... .. . ..................... .............................. ........................................................... ......................................... .......................................................... ................................................................... ........................................................................... ............................................................................. ................................................................................ .............................................................................. .......................................................................... ......................................................................... ................................................................. .................................................. ................................................ ................................................................................................................................................................... .................................................................................................... ........................................................ ......... ...................................................................................................................................................... ... .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

o oo

oo

o

oo

oo

o

o

oo

o

o

oo

o

oo

o

o oo

oo

o

o

o

oo

oo o

o

oo

o

oo

o

o

o

o

o

o

ooo

o

o

o

o oo

o

o

o

o

o

o

oo

o

o

oo

o

oo

oo

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o o

ooo

ooo

o

o

o

oo

o

ooo

o

oo

o

o

o

o

o

ooo

o

o

o o oo

o

o

o

o o

o

oo

o

oo

oo

o

o o

o

o o

o

o o

o

o

o o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

oo

o o

o

o

o o

oo

oo

oo

o

oo o

o

o

o

oo o

o

o

o

o

o oo

o

o

o

o o

o

o

oo o

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o

oo

o

ooo

o

o o

oo

o

o

1-Nearest Neighbor

................. ................. ............ .............. .................. ....................... ... ................ ..... ............... ..... ............. .... ................ ......... ..... ..... ...... ................... ..... ...... ...... .... ....... ... .......... ... .... .. .... . ........ ............ ............ ...... ............ .......... ........ .............. ........ ........ ............. .......... ...... ............... ....... ............. ............. ......... .......................... ............. ......................... . ..... .. .. ............................. ....... .. . ...... .......................... ....... ............ .............. ............................... .............................. ........................ ............ ....................... ........................ ....................... .......................................... ...... ..................... ..................... ..... ......................... .......... ............... ........ ....... ... ........................... .... ... ................ ....................... ........................ . ......... ..... .... .............. ............................. ...... ... ........................................... .......... ... ......................................... ...... .......... ...................................... .......... ......................................... ................................................... ...................................... ..................................... .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ........................................................................................................ .......................................................................................... .......... ......................................................................................... ........... ............................................................................................ ....................... .............................................................................. .................. ............................................................................... ............... ......................................................... .............. ......... ... ............................................................ ............ ....... ... .............................................................. .......... ...... .... ............................................................ ...... .. ....... .................................................................... . .. .............. ................................................................... ... ............ .... ... ........................................................ .. ............. ... ......................................................... .............. ..... ... ..... ..... ................................................. ............. ...... . ....... ... ................................................ ............... .. .......... ...... ..... ... .............................................. .................. .......... ........ ............................................. ..... .......... ......... ............................................... ... ..... ......... ................................................ . ... ... ........ ............................................... ..... ... ....... ............ ............................................ ............................. .......................... .................... ....

................................................. ............ ................. ................................................................. .................. ..................... .. ..................... ...................................... .............................................................................. .. ... .................................... ...... . ...... ...................................... ...... . ..... ............. ....................................................................................... .. . ................................................................ ... ......................................................... ...... ..................................................... ..... ........ .................................................... ... . .. .. ... ................................... .. .. .. .. .................................. ........... .... . ....... .... ............................................... ..... ..... ................................................... ...... . .................................................. .... .. ............................................................... ..... .......................................................... .............................................................................. ........................................................................... ............................................................................ .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

o oo

oo

o

oo

oo

o

o

oo

o

o

oo

o

oo

o

o oo

oo

o

o

o

oo

oo o

o

oo

o

oo

o

o

o

o

o

o

ooo

o

o

o

o oo

o

o

o

o

o

o

oo

o

o

oo

o

oo

oo

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o o

ooo

ooo

o

o

o

oo

o

ooo

o

oo

o

o

o

o

o

ooo

o

o

o o oo

o

o

o

o o

o

oo

o

oo

oo

o

o o

o

o o

o

o o

o

o

o o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

oo

o o

o

o

o o

oo

oo

oo

o

oo o

o

o

o

oo o

o

o

o

o

o oo

o

o

o

o o

o

o

oo o

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o

oo

o

ooo

o

o o

oo

o

o


k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way.

Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have



limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).


k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn.

We have:



limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:



limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs




limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs




limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs




limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs




limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).




limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)



k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?

(viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.


k-NN LSH DTs

Aspects of k-NN


How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?

(viacross-validation)




k-NN LSH DTs

Aspects of k-NN


How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)




k-NN LSH DTs

Aspects of k-NN






k-NN LSH DTs

Aspects of k-NN




What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.

Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.


k-NN LSH DTs

Aspects of k-NN






k-NN LSH DTs



Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R




k-NN LSH DTs

Nearest Neighbor Search and Computation

Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.

Recall m-dimension spherical Gaussian, zero mean, variance �2

p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)

As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.


k-NN LSH DTs



ni=1


argmini d(x(i), q).



p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)



k-NN LSH DTs



ni=1


argmini d(x(i), q).



p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)



k-NN LSH DTs



ni=1


argmini d(x(i), q).



p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)



k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� ) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.


k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2




k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2




k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2




k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.

pm sometimes called radius of the Gaussian.

For details, see the text by Blum, Hopcroft, and Kannan, 2018.


k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2


m sometimes called radius of the Gaussian.

For details, see the text by Blum, Hopcroft, and Kannan, 2018.


k-NN LSH DTs



Beyond r =p






pm � � kxk2



pm�� ) > 1 � 3e�c�2

.



pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2




int.TN

( nm . o)

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.

Possible solution: Let f : Rm ! Rm0be a projection from m to

m0 ⌧ m dimensions.

Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?

Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


.

**.

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions.

Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)









kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs


be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?

x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)









kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs


be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.

Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)









kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs


be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R?

Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)









kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs


be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?

Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)









kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


Fb (H = fs (g)

k-NN LSH DTs


be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)




as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡

pm0kxk2 with high probability.



kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs



pm)

nor necessarily orthogonal. Define f : Rm ! Rm0as







kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs



pm)





Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs



pm)







kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs



pm)





Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)


k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣��kf(x)k2 �

pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.


k-NN LSH DTs





pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)



pm0kxk2 is





k-NN LSH DTs





pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)


Follows from the Gaussian Annulus Theorem.

So probability of norm of projection deviating fromp

m0kxk2 isexponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)




k-NN LSH DTs





pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)



pm0kxk2 is

exponentially small in m0.

Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)




k-NN LSH DTs





pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)



pm0kxk2 is



m0 � 3 log(n)/(c✏2), giving � = 3/n.

This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.


k-NN LSH DTs





pm0kxk2

�� ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)



pm0kxk2 is





W- hip .

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.

m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!

Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).


k-NN LSH DTs






(1 � ✏)p


m0kx(i) � x(j)k(7.11)



most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.

Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!



k-NN LSH DTs






(1 � ✏)p


m0kx(i) � x(j)k(7.11)



most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).


ocntmkom.mil

k-NN LSH DTs






(1 � ✏)p


m0kx(i) � x(j)k(7.11)



most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!



k-NN LSH DTs






(1 � ✏)p


m0kx(i) � x(j)k(7.11)



most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).


-

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1


argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).


k-NN LSH DTs



�x(i)

ni=1


argmini d(x(i), q).







OH

k-NN LSH DTs



�x(i)

ni=1


argmini d(x(i), q).







k-NN LSH DTs



�x(i)

ni=1


argmini d(x(i), q).







k-NN LSH DTs



�x(i)

ni=1


argmini d(x(i), q).







k-NN LSH DTs



�x(i)

ni=1


argmini d(x(i), q).







k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.


k-NN LSH DTs










d. It, g)or ⇒ x. g

close

dung)> c.r ⇒ t, y

fan.

k-NN LSH DTs










9family,

k-NN LSH DTs










11 a

k-NN LSH DTs










c' A

k-NN LSH DTs










k-NN LSH DTs







When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)). When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.


LSU Allows a tradeoff between computation and statistical accuracy .

HA EH

k-NN LSH DTs

LSH instances

We can define LSH hash functions for a variety of problems, includingfor the

Hamming metric, X = [0, 1]m, d(x, y) =Pm

i=1 1{xi 6=yi}.

A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.

Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,

d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,

d(x, y) = kx � yk2.


Det Chia -Lin Distance :

gin x. yeah,chariot

k-NN LSH DTs

LSH instances



i=1 1{xi 6=yi}.

A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,


d(x, y) = kx � yk2.


k-NN LSH DTs

LSH instances



i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.

Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,


d(x, y) = kx � yk2.


He

÷:÷÷÷÷÷

k-NN LSH DTs

LSH instances



i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,


d(x, y) = kx � yk2.


k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.


k-NN LSH DTs

LSH







k-NN LSH DTs

LSH







OO

ka m

k-NN LSH DTs

LSH







2€ - - -

-

k-NN LSH DTs

LSH







X E X 12/1 is big . in bitsto represent a give XEN

takes many many bits .

•D= { H"

,H",

.. .

,XC"} n is large

but look )

is"

qe Set ( D )"

?" ⇐ 2

I -

linear scare ( n ) { INbinary scores 0(logs ) .

-

:#can u do OCD

-

h :z→oy ¥1Inflate)

E

i ' glad ⇐ 2124am-

to check it x is in the set, we allocate an

array of size L and check slot

⇐ (x) mod l) to see it its's occupied.

H'T

:X#E

k-NN LSH DTs

Recall, from lecture 3

Recall from lecture 3 the following slide.


k-NN LSH DTs

Linear Voronoi tessellation

When discriminant functions are linear (i.e., h✓(j), xi) boundaries are lines.Example in two dimensions :

From https://en.wikipedia.org/wiki/Voronoi_diagram


k-NN LSH DTs

Using a Tree to Make Decisionshttps://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c

4 leaf (or terminal) nodes3 internal nodes1 root node (as always)

r


k-NN LSH DTs

Iris flower classifier via petal length and width

https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c


k-NN LSH DTs

Regression with a tree, constant decisions



k-NN LSH DTs

Regression with a tree, linear decisions



k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).

Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)

Example: (From Hastie et. al 2009 text).

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)


k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)


Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)


k-NN LSH DTs

Decision Trees



t1

t2

t3

t4

R1

R2

R3

R4

R5

X1

X2

|

R1 R2 R3

R4 R5

X1 t1

X2 t2 X1 t3

X2 t4

X1X2

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)


Rs

RyB

R2m

k-NN LSH DTs

Decision Trees



t1

t2

t3

t4

R1

R2

R3

R4

R5

X1

X2

|

R1 R2 R3

R4 R5

X1 t1

X2 t2 X1 t3

X2 t4

X1X2

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)


CJ

Cyg

Cz9

k-NN LSH DTs

Regression Trees

Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.

h✓(x) =rX

k=1

ck1{x2Rk} (7.13)

Given data D =�x(i), y(i)

, how to grow? Minimize square errorP

i(y(i) � h✓(x(i)))2.

Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:

(7.14)

ck =1��i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?


k-NN LSH DTs

Regression Trees


h✓(x) =rX

k=1

ck1{x2Rk} (7.13)



i(y(i) � h✓(x(i)))2.


(7.14)

ck =1��i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)



o# o anti::::: :. .

k-NN LSH DTs

Regression Trees


h✓(x) =rX

k=1

ck1{x2Rk} (7.13)



i(y(i) � h✓(x(i)))2.


(7.14)

ck =1��i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)



( I

k-NN LSH DTs

Regression Trees


h✓(x) =rX

k=1

ck1{x2Rk} (7.13)



i(y(i) � h✓(x(i)))2.


(7.14)

ck =1��i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.145/188)

k-NN LSH DTs

Greedy strategy, binary splitting

Splitting variable j 2 [m] with split point s 2 R, defines two half planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)

Use optimization, find the j, s that achieves the best (minimum) in:

minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)


xz)R,

R2

¥#Ot

chooses to split on

is equal to m . -

k-NN LSH DTs



R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)


minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).

For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)


aim a- sun, un. x" 's.

..¥q•÷÷.consider Jcc) = Iz ( x" )- c)2

& finding argmch Jcc)c EIR : e th t" '

E- rel - FI, Cx"- c ) - o

Est"'=

c - n - c

k-NN LSH DTs



R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)


minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.

Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)


→

0cm - n )

j

D= { (x"

,y") } +"' EIR

"x''''= HI

,. .

, ,. . .,IT

"÷:i÷÷÷:÷÷÷: ocm.nl''II

• consideration. sfiihtsbmet.n,x,"") ) 0cm - Klugh)

k-NN LSH DTs



R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)


minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)


I2oz

¥, Cy" - e.)

2

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.

When to stop?

One option: stop when gain falls below threshold.

Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).


k-NN LSH DTs


Given R1 and R2, apply same strategy to each of R1 and R2.When to stop?

One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0



�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs


Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.

Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0



�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs


Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).

Let: k be a terminal (leaf) node, with region Rk.


�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


4,

→ %.

^

y C

¥, → In..

k-NN LSH DTs





�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs



be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes,

nk = |�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs



be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k,

ck = (1/nk)P


,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


-DT specific

k-NN LSH DTs




�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs




�x(i) 2 D : x(i) 2 Rk

| be number of data



Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


k-NN LSH DTs




�x(i) 2 D : x(i) 2 Rk

| be number of data



Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).


f.tradeofcoefficient.

W✓ regular. 7k.loss

k-NN LSH DTs




�x(i) 2 D : x(i) 2 Rk

| be number of data



Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).

weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).


k-NN LSH DTs




�x(i) 2 D : x(i) 2 Rk

| be number of data



Qk(T ) = (1/nk)P



C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)


k=1 nkQk(T ).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.161/188)

k-NN LSH DTs

Classification Trees

At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:

p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.

With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:

Classification Error1

nk

X

i2Rk

1{y(i) 6=yk} (7.21)

Entropy in region k �X

j=1

pk(y = j) log pk(y = j) (7.22)


<,

f '

I~ ,

c - - ' ilc- .

I r re( s-

r e I-.- .

i. /÷÷:i÷÷÷ :c-

e-l r f l u

\ o

X o C T,-

-- l

( s - e -- c - -

-Y l U ,

< I\

✓ - ,LL.

k-NN LSH DTs



p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.

How to judge each region? Several ways:


nk

X

i2Rk

1{y(i) 6=yk} (7.21)


j=1

pk(y = j) log pk(y = j) (7.22)


k-NN LSH DTs



p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:


nk

X

i2Rk

1{y(i) 6=yk} (7.21)


j=1

pk(y = j) log pk(y = j) (7.22)


k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)


k-NN LSH DTs

Entropy



H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)







k-NN LSH DTs

Entropy



H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)







k-NN LSH DTs

Entropy



H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)







k-NN LSH DTs

Entropy



H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)







k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

High Entropy

In Between


k-NN LSH DTs


x

p(x)

Low Entropy

High Entropy

In Between


k-NN LSH DTs


x

p(x)

Low Entropy

x

p(x)

High Entropy

In Between


k-NN LSH DTs


x

p(x)

Low Entropy

x

p(x)

High Entropy

In Between


k-NN LSH DTs


x

p(x)

Low Entropy

x

p(x)

High Entropy

x

p(x)

In Between


k-NN LSH DTs


x

p(x)

Low Entropy

x

p(x)

High Entropy

x

p(x)

In Between


I

⇒

k-NN LSH DTs

Binary Entropy

Binary alphabet, X 2 {0, 1} say.

p(X = 1) = p = 1 � p(X = 0).

H(X) = �p log p � (1 � p) log(1 � p) = H(p).

As a function of p, we get:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H(p)

p

Note, greatest uncertainty (value 1) when p = 0.5 and leastuncertainty (value 0) when p = 0 or p = 1.

Note also: concave in p.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F55/57 (pg.176/188)

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.


k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)




k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)




k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)




k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)




k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)




k-NN LSH DTs




R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)




maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)


Term: CART, for classification and regression trees.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.183/188)

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.


k-NN LSH DTs








min Loss +2. IT )

T

k-NN LSH DTs





X1

X2




k-NN LSH DTs





X1

X2




k-NN LSH DTs





X1

X2




Documents

Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation