Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Advanced Introduction to Machine Learning— Spring Quarter, Week 7 —
https://canvas.uw.edu/courses/1372141
Prof. Je↵ Bilmes
University of Washington, Seattle
Departments of: Electrical & Computer Engineering, Computer Science & Engineering
http://melodi.ee.washington.edu/~bilmes
May 11th/13th, 2020
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F1/57 (pg.1/188)
Logistics Review
Announcements
HW3 is due May 15th, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).
Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F2/57 (pg.2/188)
Logistics Review
Class Road Map
W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.
Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F3/57 (pg.3/188)
Logistics Review
Class (and Machine Learning) overview
1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?
2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning
3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning
4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap
6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods
7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)
12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling
8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction
9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization
10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing
11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization
x6
x3
x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3
x2x4
x5
x1
x3
x2x4
x1
x3
x2
x1
x2
x1
x3
x2x4
x5
x1
x2x4
x5
x1
x2
x5
x1
x2
=X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x4
X
x5
(x1, x2)� 66,2(x2)X
x3
(x1, x3) (x3, x4) (x3, x5)
| {z }� 63,1,4,5(x1,x4,x5)
=X
x5
(x1, x2)� 66,2(x2)X
x4
� 63,1,4,5(x1, x4, x5)
| {z }� 63, 64,1,5(x1,x5)
= (x1, x2)� 66,2(x2)X
x5
� 63, 64,1,5(x1, x5)
| {z }� 63,65, 64,1(x1)
= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·X
x6
p(x1, x2, . . . , x6)
X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x3
X
x4
(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X
x5
(x3, x5)
| {z }� 65,3(x3)
=X
x3
(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X
x4
(x3, x4)
| {z }� 64,3(x3)
= (x1, x2)� 66,2(x2)X
x3
(x1, x3)� 65,3(x3)� 64,3(x3)
| {z }� 65, 64, 63,1(x1)
= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·X
x6
p(x1, x2, . . . , x6)
=
Reconstituted Graph Reconstituted Graph
O(r2)
O(r4)
O(r3)
O(r2)
x6
x5
x4
x3
O(r2)
O(r2)
O(r2)
O(r2)
GraphicalTransformation
CorrespondingMarginalization Operation
GraphicalTransformation
CorrespondingMarginalization Operation
Variableto
Eliminateand
Complexity
Variableto
Eliminateand
Complexity
InputLayer
HiddenLayer 1
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
HiddenLayer 5
HiddenLayer 6
HiddenLayer 7
OutputUnit
5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F4/57 (pg.4/188)
Logistics Review
Using 2D to represent High-D as if 2D was High-D
Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.
Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as
pm/2.
For m = 4 vertex distance isp
m/2 = 1.
1
12
p22
1
12
11
12
pm2
Unit radius sphere
�Nearly all the volume
Vertex of hypercube
Illustration of the relationship between the sphere and the cube in 2, 4, and
m-dimensions (from Blum, Hopcroft, & Kannan, 2016).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F5/57 (pg.5/188)
Logistics Review
The Blessing of Dimensionality
More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation
Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!
For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F6/57 (pg.6/188)
Logistics Review
Feature Selection vs. Dimensionality Reduction
We’ve already seen that feature selection is one strategy to reducefeature dimensionality.
Feature selection, each feature is either selected or not, all or nothing.
Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0
, a lower dimensional space, m0 < m.
Core advantage: If U ={1, 2, . . . , m}, there may be nosubset A ✓ U such that xA
will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2
works quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F7/57 (pg.7/188)
Logistics Review
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0 < m dimensions, sox(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i) � x(i)k22 (7.1)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the ith largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m ⇥ m0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F8/57 (pg.8/188)
Logistics Review
Example: 2D and two principle directions
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F9/57 (pg.9/188)
Logistics Review
PCA vs. LDA
From Bishop 2006
PCA vs. Linear Discriminant Analysis (LDA).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F10/57 (pg.10/188)
Logistics Review
Axis parallel projection vs. LDA
From Hastie et. al., 2009
+
++
+
Axis parallel (left) vs. LDA (right).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F11/57 (pg.11/188)
Logistics Review
Linear Discriminant Analysis (LDA), 2 classes
Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.
p(x|y) =1
|2⇡Cy|m/2exp
✓�1
2(x � µy)
|C�1y (x � µy)
◆(7.2)
Two class case y 2 {0, 1} with equal covariances C0 = C1 = C(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:
logp(y = 1|x)
p(y = 0|x)= �1
2(x � µ1)
|C�11 (x � µ1) +
1
2(x � µ0)
|C�10 (x � µ0)
(7.3)
= (C�1µ1 � C�1µ0)|x + µ0
|C�1µ0 � µ1|C�1µ1 (7.4)
= ✓|x + c (7.5)
✓ is a projection (m ⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F12/57 (pg.12/188)
Logistics Review
Linear Discriminant Analysis (LDA), ` classesWithin class scatter matrix
SW =X
c=1
Sc where Sc =X
i:i is class c
(x(i) � µc)(x(i) � µc)
|(7.4)
and µc is the class c mean, µc = 1nc
Pi is of class c x(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (7.5)
First projection, maximize ratio of variances (recall PCA):a1 2 argmaxa2Rm
a|SBaa|SW a . Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S�1
W SB to get LDA reduction.LDA dimensionality reduction: find `0 ⇥m matrix linear transform on x,project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F13/57 (pg.13/188)
Logistics Review
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1 � e�m✏2/2 (7.11)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F14/57 (pg.14/188)
Logistics Review
k-NN classifier, multiclass data
Let D =�x(i), y(i)
ni=1
be a training set of n samples (as always).
Let Nk(x, D) ✓ D be the subset of D of size k which are the k nearestneighbors of x. That is |Nk(x, D)| = k and for any(x(i), y(i)) 2 Nk(x, D) we have that d(x(i), x) is no more than the kth
farthest data point in D from x, under distance d(·, ·).We can then estimate a posterior probability in binary case:
pkNN(y = c|x) =1
k
X
(x(i),y(i))2Nk(x,D)
1{y(i)=c} (7.15)
Valid probability 0 pkNN(y = c|x) 1 andP`
c=1 pkNN(y = c|x) = 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F15/57 (pg.15/188)
Logistics Review
k-NN classifier, various k
From Hastie et. al., 2009
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
..
. .. .. .. .. . .. . .. . .. . . . .. . . . . .. . . . . . .. . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . . . . . ............... ................. .................. .................... ....................... ......................... .......................... .......................... ............................ ............................. ............................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
oo
ooo
o
o
o
o
o
o
o
o
oo
o
o o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o
oo o
oo
oo
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
ooo
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o oooo
o
ooo o
o
o
o
o
o
o
o
ooo
ooo
ooo
o
o
ooo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
oo
ooo
o
o
o
o
o
o
ooo
oo oo
o
o
o
o
o
o
o
o
o
o
k-NN classifier, with k = 15
.. .. .. . . . . .. . . . . . .. . . . . . . . .. . . . . . . . ............ .............. ............... .................. ................... .................... ..................... .......................... ......................... ........................ ........................ ........................ .......................... .......................... .......................... ........................... ............................ .............................. ................................ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................. .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . ..................... ...................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . ....................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . ............. . . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . . ................. ................... .................... ................... .............. . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . .
oo
ooo
o
o
o
o
o
o
o
o
oo
o
o o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o
oo o
oo
oo
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
ooo
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o oooo
o
ooo o
o
o
o
o
o
o
o
ooo
ooo
ooo
o
o
ooo
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
oo
ooo
o
o
o
o
o
o
ooo
oo oo
o
o
o
o
o
o
o
o
o
o
k-NN classifier, with k = 1
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F16/57 (pg.16/188)
Logistics Review
k-NN classifiers, e↵ective degrees of freedom
Hyperparameter is k (similar to � in ridge or lasso).A useful measure is n/k which is e↵ective degrees of freedom in ak-NN classifier. n/k is e↵ective number of neighborhoods within whichwe might fit a mean.
We can look at train/testerror as a function of n/kand consider relative to lo-gistic regression.
'HJUHHV�RI�)UHHGRP�ï�Q�N
Test
Erro
r
0.10
0.15
0.20
0.25
0.30
2 3 5 8 12 18 29 67 200
151 101 69 45 31 21 11 7 5 3 1
7UDLQTestBayes
/LQHDU
From Hastie et. al., 2009
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F17/57 (pg.17/188)
k-NN LSH DTs
1D manifolds in 2D ambient space
One-dimensional spiral manifold in 2D space
�300 �200 �100 0 100x1
�200
�100
0
100
200
300
400
x2
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.18/188)
k-NN LSH DTs
1D manifolds in 2D ambient space, two classes
One-dimensional spiral manifold in 2D space, along with PCA projection.
�300 �200 �100 0 100x1
�200
�100
0
100
200
300
400
x2
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.19/188)
k-NN LSH DTs
1D manifolds in 2D ambient space
One-dimensional spiral manifold in 2D space, along with PCA projection, 2class case.
�300 �200 �100 0 100x1
�200
�100
0
100
200
300
400
x2
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.20/188)
k-NN LSH DTs
2D Manifold Examples in 3d Ambient Space
(A): manifolds. (B) samples from manifold. (C) Inherent flattened 2Dmanifolds.
From: Roweiss & Saul, 1999
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F19/57 (pg.21/188)
k-NN LSH DTs
Manifold Learning in ML
Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.
I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.
Useful in machine learning for non-linear methods for dimensionalityreduction.
Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.
Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.22/188)
k-NN LSH DTs
Manifold Learning in ML
Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.
I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.
Useful in machine learning for non-linear methods for dimensionalityreduction.
Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.
Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.23/188)
k-NN LSH DTs
Manifold Learning in ML
Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.
I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.
Useful in machine learning for non-linear methods for dimensionalityreduction.
Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.
Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.24/188)
k-NN LSH DTs
Manifold Learning in ML
Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.
I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.
Useful in machine learning for non-linear methods for dimensionalityreduction.
Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.
Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.25/188)
k-NN LSH DTs
Manifold Learning in ML
Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.
I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.
Useful in machine learning for non-linear methods for dimensionalityreduction.
Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.
Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.26/188)
k-NN LSH DTs
k-NN, memory, parametric vs. non-parametric
k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).
k-NN classifiers need to memorize the entire training data set
Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.
Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.
Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.27/188)
k-NN LSH DTs
k-NN, memory, parametric vs. non-parametric
k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).
k-NN classifiers need to memorize the entire training data set
Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.
Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.
Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.28/188)
k-NN LSH DTs
k-NN, memory, parametric vs. non-parametric
k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).
k-NN classifiers need to memorize the entire training data set
Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.
Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.
Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.29/188)
k-NN LSH DTs
k-NN, memory, parametric vs. non-parametric
k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).
k-NN classifiers need to memorize the entire training data set
Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.
Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.
Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.30/188)
k-NN LSH DTs
k-NN, memory, parametric vs. non-parametric
k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).
k-NN classifiers need to memorize the entire training data set
Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.
Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.
Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.31/188)
k-NN LSH DTs
Computational Considerationsk-NN classifiers require memory at least to store the n data items.
With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!
There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.
Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.32/188)
k-NN LSH DTs
Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!
There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.
Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.33/188)
k-NN LSH DTs
Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!
There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.
Q
B1
B2
B3 B4
B5 B6
B7
B8 B9
B10
B11
B13B12
R
From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”
Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.34/188)
k-NN LSH DTs
Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!
There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.
Q
B1
B2
B3 B4
B5 B6
B7
B8 B9
B10
B11
B13B12
R
From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”
Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.35/188)
k-NN LSH DTs
Recall: Bayes error
Recall next two slides from lecture 4.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F23/57 (pg.36/188)
k-NN LSH DTs
0/1-loss and Bayes error
Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.
Probability of error for a given x
Pr(h✓(x) 6= Y ) =
Zp(y|x)1{h✓(x) 6=y}dy = 1 � p(h✓(x)|x) (7.7)
To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.
Smallest probability of error is known is Bayes error for x
Bayes error(x) = miny
(1 � p(y|x)) = 1 � maxy
p(y|x) (7.8)
Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,
BayesPredictor(x) , argmaxy
p(y|x) (7.9)
Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F24/57 (pg.37/188)
k-NN LSH DTs
Bayes Error, overall di�culty of a problem
For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.
Bayes error, various equivalent forms (Exercise: show last equality):
Bayes Error = minh
Zp(x, y)1{h(x) 6=y}dydx (7.13)
= minh
Pr(h(X) 6= Y ) (7.14)
= Ep(x)[min(p(y|x), 1 � p(y|x))] (7.15)
=1
2� 1
2Ep(x)[|2p(y|x) � 1|] (7.16)
For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).
Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F25/57 (pg.38/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1.
So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn
r�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.39/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.
Xn ! X in the rth mean (r � 1), (written Xnr�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.40/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn
r�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.41/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn
r�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.42/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn
r�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.
a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.43/188)
k-NN LSH DTs
Aside: modes of convergence
Xn ! X almost surely (Xna.s.��! X) if the set
{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)
is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn
r�! X) if
E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)
Xn ! X in probability (written Xnp�! X) if
p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)
Xn ! X in distribution (written XnD�! X) if
p(Xn x) ! P (X < x) as n ! 1 (7.4)
for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.44/188)
F a sequin Ot r. V. s
1) X,
X, .tn, tis, . . . . . .
How might we measureh-w
Xu approach, X ?
There are four standard
woes :
27
b) AS.
⇒pros →dist
rtt
4)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).
Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.45/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.
Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.46/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.
✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.47/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.
h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.48/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).
Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.49/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).
We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.50/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.51/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.52/188)
k-NN LSH DTs
On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn
⇥R(h✓(Dn)|Dn)
⇤improves with n.
Definition 7.3.1 (Consistency)
A classification process is consistent for distribution p(x, y) if
EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.
Definition 7.3.2 (Universal Consistency)
A classification process is universally consistent if it is consistent for alldistributions p(x, y).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.53/188)
BT
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?
As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.54/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).
Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.55/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.56/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.
Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.57/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!
Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.58/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).
Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.59/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.60/188)
k-NN LSH DTs
How far away are we from Bayes error
Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.
Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:
Theorem 7.3.3 (Stone, 1977)
Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.61/188)
k-NN LSH DTs
k-NN and Bayes error
3-class example for various k and comparing to Bayes error (purple dashedlines).From Hastie et. al., 2009
15-Nearest Neighbors
. ........ ......... ..... ...... ......... ......................................................................................... ............................................................................................................................................................................................................................................................................ ........................................................................... .............................................................................. .. . .... .... ......................................... .................................................................. ..................................................................... ..................................................................................................................................................................................... ................................................................................................................................................ .............................................. .. .................................................................................... . .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... . .................................................................................................................. .................................................................................................................................................................................................................................. ..................................................................................................... ...... ..................................................................................................... ..... ..... ............................................................................................... .... ...................................................................................................... ...... ............................................................................................................ ........................................................................................................... ....................................................................... .................................. .............................................................. .... . . ........................ ............................................................. ........................ .................................................... ................. ............................................. ....... ... ........................................................................................................ .............................. . ...................... ......
.... .. . ..................... .............................. ........................................................... ......................................... .......................................................... ................................................................... ........................................................................... ............................................................................. ................................................................................ .............................................................................. .......................................................................... ......................................................................... ................................................................. .................................................. ................................................ ................................................................................................................................................................... .................................................................................................... ........................................................ ......... ...................................................................................................................................................... ... .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
o oo
oo
o
oo
oo
o
o
oo
o
o
oo
o
oo
o
o oo
oo
o
o
o
oo
oo o
o
oo
o
oo
o
o
o
o
o
o
ooo
o
o
o
o oo
o
o
o
o
o
o
oo
o
o
oo
o
oo
oo
oo
o
o
o
o
o
o
o
oo
oo
o
o
o
o o
ooo
ooo
o
o
o
oo
o
ooo
o
oo
o
o
o
o
o
ooo
o
o
o o oo
o
o
o
o o
o
oo
o
oo
oo
o
o o
o
o o
o
o o
o
o
o o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
oo
o
o
oo
o o
o
o
o o
oo
oo
oo
o
oo o
o
o
o
oo o
o
o
o
o
o oo
o
o
o
o o
o
o
oo o
o
o
ooo
o
o
o
o
o
oo
o
o
o
o
oo
o
o
oo
oo
o
o
o
o
o
o
o
oo
oo
o
oo
o
o
o
o
o
o
o oo
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
oo
oo
o
o
o
oo
o
ooo
o
o o
oo
o
o
1-Nearest Neighbor
................. ................. ............ .............. .................. ....................... ... ................ ..... ............... ..... ............. .... ................ ......... ..... ..... ...... ................... ..... ...... ...... .... ....... ... .......... ... .... .. .... . ........ ............ ............ ...... ............ .......... ........ .............. ........ ........ ............. .......... ...... ............... ....... ............. ............. ......... .......................... ............. ......................... . ..... .. .. ............................. ....... .. . ...... .......................... ....... ............ .............. ............................... .............................. ........................ ............ ....................... ........................ ....................... .......................................... ...... ..................... ..................... ..... ......................... .......... ............... ........ ....... ... ........................... .... ... ................ ....................... ........................ . ......... ..... .... .............. ............................. ...... ... ........................................... .......... ... ......................................... ...... .......... ...................................... .......... ......................................... ................................................... ...................................... ..................................... .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ........................................................................................................ .......................................................................................... .......... ......................................................................................... ........... ............................................................................................ ....................... .............................................................................. .................. ............................................................................... ............... ......................................................... .............. ......... ... ............................................................ ............ ....... ... .............................................................. .......... ...... .... ............................................................ ...... .. ....... .................................................................... . .. .............. ................................................................... ... ............ .... ... ........................................................ .. ............. ... ......................................................... .............. ..... ... ..... ..... ................................................. ............. ...... . ....... ... ................................................ ............... .. .......... ...... ..... ... .............................................. .................. .......... ........ ............................................. ..... .......... ......... ............................................... ... ..... ......... ................................................ . ... ... ........ ............................................... ..... ... ....... ............ ............................................ ............................. .......................... .................... ....
................................................. ............ ................. ................................................................. .................. ..................... .. ..................... ...................................... .............................................................................. .. ... .................................... ...... . ...... ...................................... ...... . ..... ............. ....................................................................................... .. . ................................................................ ... ......................................................... ...... ..................................................... ..... ........ .................................................... ... . .. .. ... ................................... .. .. .. .. .................................. ........... .... . ....... .... ............................................... ..... ..... ................................................... ...... . .................................................. .... .. ............................................................... ..... .......................................................... .............................................................................. ........................................................................... ............................................................................ .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
o oo
oo
o
oo
oo
o
o
oo
o
o
oo
o
oo
o
o oo
oo
o
o
o
oo
oo o
o
oo
o
oo
o
o
o
o
o
o
ooo
o
o
o
o oo
o
o
o
o
o
o
oo
o
o
oo
o
oo
oo
oo
o
o
o
o
o
o
o
oo
oo
o
o
o
o o
ooo
ooo
o
o
o
oo
o
ooo
o
oo
o
o
o
o
o
ooo
o
o
o o oo
o
o
o
o o
o
oo
o
oo
oo
o
o o
o
o o
o
o o
o
o
o o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
oo
o
o
oo
o o
o
o
o o
oo
oo
oo
o
oo o
o
o
o
oo o
o
o
o
o
o oo
o
o
o
o o
o
o
oo o
o
o
ooo
o
o
o
o
o
oo
o
o
o
o
oo
o
o
oo
oo
o
o
o
o
o
o
o
oo
oo
o
oo
o
o
o
o
o
o
o oo
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
oo
oo
o
o
o
oo
o
ooo
o
o o
oo
o
o
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F29/57 (pg.62/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way.
Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.63/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn.
We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.64/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.65/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.66/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.67/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.68/188)
k-NN LSH DTs
What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:
Theorem 7.3.4 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)
where ⌘(x) = E[Y |X = x] = p(y|x).
As a consequence, we have
Theorem 7.3.5 (Cover & Hart, 1967)
For the 1-NN rule and for any distribution p(x, y), we have
limn!1
EDn [R(h1-NN,Dn)] 2R⇤ (7.6)
Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.69/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?
(viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.70/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?
(viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.71/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.72/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.73/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.
Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.74/188)
k-NN LSH DTs
Aspects of k-NN
Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.
How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)
Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?
What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.75/188)
k-NN LSH DTs
Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!
There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.
Q
B1
B2
B3 B4
B5 B6
B7
B8 B9
B10
B11
B13B12
R
From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”
Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F32/57 (pg.76/188)
k-NN LSH DTs
Nearest Neighbor Search and Computation
Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.
Recall m-dimension spherical Gaussian, zero mean, variance �2
p(x) =1
(2⇡�2)m/2exp
✓�kxk22
2�2
◆(7.7)
As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.77/188)
k-NN LSH DTs
Nearest Neighbor Search and Computation
Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.
Recall m-dimension spherical Gaussian, zero mean, variance �2
p(x) =1
(2⇡�2)m/2exp
✓�kxk22
2�2
◆(7.7)
As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.78/188)
k-NN LSH DTs
Nearest Neighbor Search and Computation
Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.
Recall m-dimension spherical Gaussian, zero mean, variance �2
p(x) =1
(2⇡�2)m/2exp
✓�kxk22
2�2
◆(7.7)
As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.79/188)
k-NN LSH DTs
Nearest Neighbor Search and Computation
Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.
Recall m-dimension spherical Gaussian, zero mean, variance �2
p(x) =1
(2⇡�2)m/2exp
✓�kxk22
2�2
◆(7.7)
As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.80/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.81/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.82/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.83/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.84/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.
pm sometimes called radius of the Gaussian.
For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.85/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.
For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.86/188)
k-NN LSH DTs
Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =
pm is mass over ball significant.
Beyond r =p
m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:
Theorem 7.4.1 (Gaussian Annulus Theorem)
For any m-dimensional spherical Gaussian with common unit variance, forany �
pm, all but at most 3e�c�2
of the probability mass lies withinannulus
pm � � kxk2
pm + �, where c > 0 is a constant.
Theorem says that Pr(��kXk2 �
pm�� �) > 1 � 3e�c�2
.
Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �
achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |
pm + � � (
pm � �)|/
pm ! 0 as m ! 1.
E[kXk22] =Pm
i=1 E[X2i ] = mE[X2
i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p
m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.87/188)
int.TN
( nm . o)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.
Possible solution: Let f : Rm ! Rm0be a projection from m to
m0 ⌧ m dimensions.
Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?
Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.88/188)
.
**.
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions.
Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.89/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?
x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.90/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.
Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.91/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R?
Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.92/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?
Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.93/188)
Fb (H = fs (g)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal.
Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡
pm0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.94/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal. Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.
We claim that for any x, kf(x)k2 ⇡p
m0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.95/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal. Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡
pm0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.96/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal. Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡
pm0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.
To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.97/188)
k-NN LSH DTs
Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0
be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius
pm)
nor necessarily orthogonal. Define f : Rm ! Rm0as
f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)
as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡
pm0kxk2 with high probability.
Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:
kx � yk2 =
pm0
pm0
kx � yk2 ⇡ 1pm0
kf(x � y)k2 =1pm0
kf(x) � f(y)k2 (7.9)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.98/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from
pm0kxk2 is
exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.99/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from
pm0kxk2 is
exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.100/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.
So probability of norm of projection deviating fromp
m0kxk2 isexponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.101/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from
pm0kxk2 is
exponentially small in m0.
Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.102/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from
pm0kxk2 is
exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.
This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.103/188)
k-NN LSH DTs
Random Projection TheoremSpecifically, it can be shown that:
Theorem 7.4.2 (Random Projection Theorem)
Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),
Pr⇣���kf(x)k2 �
pm0kxk2
��� � ✏p
m0kxk2⌘
3e�cm0✏2 (7.10)
where probability is taken over random draws of u vectors.
Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from
pm0kxk2 is
exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)
Pi Pr(Ai)) as long as if m0 is not too small, i.e.,
m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.104/188)
W- hip .
k-NN LSH DTs
Johnson-Lindenstrauss Lemma
Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)
For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in
Rm. The random projection f : Rm ! Rm0has the property that for all
x(i), x(j), then with probability at least 1 � 3/(2n), we have:
(1 � ✏)p
m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p
m0kx(i) � x(j)k(7.11)
I.e., the probability of all�n2
�< n2/2 pairs of points so deviating is at
most 3/(2n), so so can be used to do fast NNS.
m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.
Speedup: 1010/288 = 34 million!
Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.105/188)
k-NN LSH DTs
Johnson-Lindenstrauss Lemma
Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)
For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in
Rm. The random projection f : Rm ! Rm0has the property that for all
x(i), x(j), then with probability at least 1 � 3/(2n), we have:
(1 � ✏)p
m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p
m0kx(i) � x(j)k(7.11)
I.e., the probability of all�n2
�< n2/2 pairs of points so deviating is at
most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.
Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.
Speedup: 1010/288 = 34 million!
Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.106/188)
k-NN LSH DTs
Johnson-Lindenstrauss Lemma
Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)
For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in
Rm. The random projection f : Rm ! Rm0has the property that for all
x(i), x(j), then with probability at least 1 � 3/(2n), we have:
(1 � ✏)p
m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p
m0kx(i) � x(j)k(7.11)
I.e., the probability of all�n2
�< n2/2 pairs of points so deviating is at
most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.
Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.107/188)
ocntmkom.mil
k-NN LSH DTs
Johnson-Lindenstrauss Lemma
Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)
For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in
Rm. The random projection f : Rm ! Rm0has the property that for all
x(i), x(j), then with probability at least 1 � 3/(2n), we have:
(1 � ✏)p
m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p
m0kx(i) � x(j)k(7.11)
I.e., the probability of all�n2
�< n2/2 pairs of points so deviating is at
most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!
Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.108/188)
k-NN LSH DTs
Johnson-Lindenstrauss Lemma
Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)
For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in
Rm. The random projection f : Rm ! Rm0has the property that for all
x(i), x(j), then with probability at least 1 � 3/(2n), we have:
(1 � ✏)p
m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p
m0kx(i) � x(j)k(7.11)
I.e., the probability of all�n2
�< n2/2 pairs of points so deviating is at
most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.109/188)
-
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.110/188)
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.111/188)
OH
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.112/188)
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.113/188)
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.114/188)
k-NN LSH DTs
Approximate Nearest Neighbor Search: LSH
Recall: nearest neighbor search (NNS) problem: Given a set of npoints
�x(i)
ni=1
, x(i) 2 Rm, finding nearest to query q 2 Rm is
argmini d(x(i), q).
Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.
Locality Sensitive Hashing (LSH) helps us achieve this.
Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.
In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.
LSH is similar to clustering (which we will discuss in a future lecture).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.115/188)
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.116/188)
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.117/188)
d. It, g)or ⇒ x. g
close
dung)> c.r ⇒ t, y
fan.
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.118/188)
9family,
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.119/188)
11 a
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.120/188)
c' A
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).
When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.121/188)
k-NN LSH DTs
LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).
A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.
Definition 7.4.4 (LSH)
A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):
(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).
(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).
When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)). When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.122/188)
LSU Allows a tradeoff between computation and statistical accuracy .
HA EH
k-NN LSH DTs
LSH instances
We can define LSH hash functions for a variety of problems, includingfor the
Hamming metric, X = [0, 1]m, d(x, y) =Pm
i=1 1{xi 6=yi}.
A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.
Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,
d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,
d(x, y) = kx � yk2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.123/188)
Det Chia -Lin Distance :
gin x. yeah,chariot
k-NN LSH DTs
LSH instances
We can define LSH hash functions for a variety of problems, includingfor the
Hamming metric, X = [0, 1]m, d(x, y) =Pm
i=1 1{xi 6=yi}.
A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,
d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,
d(x, y) = kx � yk2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.124/188)
k-NN LSH DTs
LSH instances
We can define LSH hash functions for a variety of problems, includingfor the
Hamming metric, X = [0, 1]m, d(x, y) =Pm
i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.
Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,
d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,
d(x, y) = kx � yk2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.125/188)
He
÷:÷÷÷÷÷
k-NN LSH DTs
LSH instances
We can define LSH hash functions for a variety of problems, includingfor the
Hamming metric, X = [0, 1]m, d(x, y) =Pm
i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,
d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,
d(x, y) = kx � yk2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.126/188)
k-NN LSH DTs
LSH
In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).
While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)
To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.
We sometimes use multiple g’s drawn randomly too.
Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.127/188)
k-NN LSH DTs
LSH
In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).
While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)
To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.
We sometimes use multiple g’s drawn randomly too.
Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.128/188)
k-NN LSH DTs
LSH
In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).
While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)
To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.
We sometimes use multiple g’s drawn randomly too.
Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.129/188)
OO
ka m
k-NN LSH DTs
LSH
In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).
While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)
To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.
We sometimes use multiple g’s drawn randomly too.
Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.130/188)
2€ - - -
-
k-NN LSH DTs
LSH
In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).
While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)
To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.
We sometimes use multiple g’s drawn randomly too.
Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.131/188)
X E X 12/1 is big . in bitsto represent a give XEN
takes many many bits .
•D= { H"
,H",
.. .
,XC"} n is large
but look )
is"
qe Set ( D )"
?" ⇐ 2
I -
linear scare ( n ) { INbinary scores 0(logs ) .
-
:#can u do OCD
-
h :z→oy ¥1Inflate)
E
i ' glad ⇐ 2124am-
to check it x is in the set, we allocate an
array of size L and check slot
⇐ (x) mod l) to see it its's occupied.
H'T
:X#E
k-NN LSH DTs
Recall, from lecture 3
Recall from lecture 3 the following slide.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F42/57 (pg.132/188)
k-NN LSH DTs
Linear Voronoi tessellation
When discriminant functions are linear (i.e., h✓(j), xi) boundaries are lines.Example in two dimensions :
From https://en.wikipedia.org/wiki/Voronoi_diagram
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F43/57 (pg.133/188)
k-NN LSH DTs
Using a Tree to Make Decisionshttps://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c
4 leaf (or terminal) nodes3 internal nodes1 root node (as always)
r
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F44/57 (pg.134/188)
k-NN LSH DTs
Iris flower classifier via petal length and width
https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F45/57 (pg.135/188)
k-NN LSH DTs
Regression with a tree, constant decisions
https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F46/57 (pg.136/188)
k-NN LSH DTs
Regression with a tree, linear decisions
https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F47/57 (pg.137/188)
k-NN LSH DTs
Decision Trees
Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).
Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)
Example: (From Hastie et. al 2009 text).
Corresponds to
h(x) =5X
i=1
ci1{x2Ri} (7.12)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.138/188)
k-NN LSH DTs
Decision Trees
Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)
Example: (From Hastie et. al 2009 text).
Corresponds to
h(x) =5X
i=1
ci1{x2Ri} (7.12)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.139/188)
k-NN LSH DTs
Decision Trees
Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)
Example: (From Hastie et. al 2009 text).
t1
t2
t3
t4
R1
R2
R3
R4
R5
X1
X2
|
R1 R2 R3
R4 R5
X1 t1
X2 t2 X1 t3
X2 t4
X1X2
Corresponds to
h(x) =5X
i=1
ci1{x2Ri} (7.12)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.140/188)
Rs
RyB
R2m
k-NN LSH DTs
Decision Trees
Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)
Example: (From Hastie et. al 2009 text).
t1
t2
t3
t4
R1
R2
R3
R4
R5
X1
X2
|
R1 R2 R3
R4 R5
X1 t1
X2 t2 X1 t3
X2 t4
X1X2
Corresponds to
h(x) =5X
i=1
ci1{x2Ri} (7.12)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.141/188)
CJ
Cyg
Cz9
k-NN LSH DTs
Regression Trees
Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.
h✓(x) =rX
k=1
ck1{x2Rk} (7.13)
Given data D =�x(i), y(i)
, how to grow? Minimize square errorP
i(y(i) � h✓(x(i)))2.
Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:
(7.14)
ck =1���i : x(i) 2 Rk
��X
i:x(i)2Rk
y(i) (7.15)
How to find the r regions?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.142/188)
k-NN LSH DTs
Regression Trees
Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.
h✓(x) =rX
k=1
ck1{x2Rk} (7.13)
Given data D =�x(i), y(i)
, how to grow? Minimize square errorP
i(y(i) � h✓(x(i)))2.
Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:
(7.14)
ck =1���i : x(i) 2 Rk
��X
i:x(i)2Rk
y(i) (7.15)
How to find the r regions?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.143/188)
o# o anti::::: :. .
k-NN LSH DTs
Regression Trees
Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.
h✓(x) =rX
k=1
ck1{x2Rk} (7.13)
Given data D =�x(i), y(i)
, how to grow? Minimize square errorP
i(y(i) � h✓(x(i)))2.
Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:
(7.14)
ck =1���i : x(i) 2 Rk
��X
i:x(i)2Rk
y(i) (7.15)
How to find the r regions?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.144/188)
( I
k-NN LSH DTs
Regression Trees
Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.
h✓(x) =rX
k=1
ck1{x2Rk} (7.13)
Given data D =�x(i), y(i)
, how to grow? Minimize square errorP
i(y(i) � h✓(x(i)))2.
Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:
(7.14)
ck =1���i : x(i) 2 Rk
��X
i:x(i)2Rk
y(i) (7.15)
How to find the r regions?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.145/188)
k-NN LSH DTs
Greedy strategy, binary splitting
Splitting variable j 2 [m] with split point s 2 R, defines two half planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)
Use optimization, find the j, s that achieves the best (minimum) in:
minj,s
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5 (7.17)
where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain
maxj,s
0
@X
x(i)
(y(i) � c0)2 �
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5
1
A
(7.18)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.146/188)
xz)R,
R2
¥#Ot
chooses to split on
is equal to m . -
k-NN LSH DTs
Greedy strategy, binary splitting
Splitting variable j 2 [m] with split point s 2 R, defines two half planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)
Use optimization, find the j, s that achieves the best (minimum) in:
minj,s
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5 (7.17)
where ck = average(y(i)|x(i) 2 Rk).
For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain
maxj,s
0
@X
x(i)
(y(i) � c0)2 �
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5
1
A
(7.18)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.147/188)
aim a- sun, un. x" 's.
..¥q•÷÷.consider Jcc) = Iz ( x" )- c)2
& finding argmch Jcc)c EIR : e th t" '
E- rel - FI, Cx"- c ) - o
Est"'=
c - n - c
k-NN LSH DTs
Greedy strategy, binary splitting
Splitting variable j 2 [m] with split point s 2 R, defines two half planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)
Use optimization, find the j, s that achieves the best (minimum) in:
minj,s
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5 (7.17)
where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.
Can also be seen as maximizing the MSE gain
maxj,s
0
@X
x(i)
(y(i) � c0)2 �
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5
1
A
(7.18)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.148/188)
→
0cm - n )
j
D= { (x"
,y") } +"' EIR
"x''''= HI
,. .
, ,. . .,IT
"÷:i÷÷÷:÷÷÷: ocm.nl''II
• consideration. sfiihtsbmet.n,x,"") ) 0cm - Klugh)
k-NN LSH DTs
Greedy strategy, binary splitting
Splitting variable j 2 [m] with split point s 2 R, defines two half planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)
Use optimization, find the j, s that achieves the best (minimum) in:
minj,s
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5 (7.17)
where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain
maxj,s
0
@X
x(i)
(y(i) � c0)2 �
2
4X
x(i)2R1(j,s)
(y(i) � c1)2 +
X
x(i)2R2(j,s)
(y(i) � c2)2
3
5
1
A
(7.18)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.149/188)
I2oz
¥, Cy" - e.)
2
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.
When to stop?
One option: stop when gain falls below threshold.
Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.
|T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.150/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop?
One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.
|T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.151/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.
Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.
|T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.152/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).
Let: k be a terminal (leaf) node, with region Rk.
|T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.153/188)
4,
→ %.
^
y C
¥, → In..
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.
|T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.154/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes,
nk = |�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k
,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.155/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k,
ck = (1/nk)P
x(i)2Rky(i) be prediction in region k
,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.156/188)
-DT specific
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k, ck = (1/nk)P
x(i)2Rky(i) be prediction in region k,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.157/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k, ck = (1/nk)P
x(i)2Rky(i) be prediction in region k,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.158/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k, ck = (1/nk)P
x(i)2Rky(i) be prediction in region k,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.159/188)
f.tradeofcoefficient.
W✓ regular. 7k.loss
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k, ck = (1/nk)P
x(i)2Rky(i) be prediction in region k,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).
weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.160/188)
k-NN LSH DTs
Greedy strategy, recursive binary splitting
Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0
be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |
�x(i) 2 D : x(i) 2 Rk
| be number of data
points in region k, ck = (1/nk)P
x(i)2Rky(i) be prediction in region k,
Qk(T ) = (1/nk)P
x(i)2Rk(y(i) � ck)2 be error in region k.
Overall cost complexity defined as
C↵(T ) =
|T |X
k=1
nkQk(T ) + ↵|T | (7.19)
Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |
k=1 nkQk(T ).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.161/188)
k-NN LSH DTs
Classification Trees
At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:
p(y = j|x) =1
nk(x)
X
x(i)2Rk(x)
1{y(i)=j} = pk(x)(y = j) (7.20)
where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.
With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:
Classification Error1
nk
X
i2Rk
1{y(i) 6=yk} (7.21)
Entropy in region k �X
j=1
pk(y = j) log pk(y = j) (7.22)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.162/188)
<,
f '
I~ ,
c - - ' ilc- .
I r re( s-
r e I-.- .
i. /÷÷:i÷÷÷ :c-
e-l r f l u
\ o
X o C T,-
-- l
( s - e -- c - -
-Y l U ,
< I\
✓ - ,LL.
k-NN LSH DTs
Classification Trees
At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:
p(y = j|x) =1
nk(x)
X
x(i)2Rk(x)
1{y(i)=j} = pk(x)(y = j) (7.20)
where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.
How to judge each region? Several ways:
Classification Error1
nk
X
i2Rk
1{y(i) 6=yk} (7.21)
Entropy in region k �X
j=1
pk(y = j) log pk(y = j) (7.22)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.163/188)
k-NN LSH DTs
Classification Trees
At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:
p(y = j|x) =1
nk(x)
X
x(i)2Rk(x)
1{y(i)=j} = pk(x)(y = j) (7.20)
where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:
Classification Error1
nk
X
i2Rk
1{y(i) 6=yk} (7.21)
Entropy in region k �X
j=1
pk(y = j) log pk(y = j) (7.22)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.164/188)
k-NN LSH DTs
Entropy
Definition 7.5.1 (Entropy)
Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:
H(X) , E log1
p(X)=X
x
p(x) log1
p(x)= �
X
x
p(x) log p(x) (7.23)
1/p(x) is surprise of x, � log p(x) is log surprise.
Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.165/188)
k-NN LSH DTs
Entropy
Definition 7.5.1 (Entropy)
Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:
H(X) , E log1
p(X)=X
x
p(x) log1
p(x)= �
X
x
p(x) log p(x) (7.23)
1/p(x) is surprise of x, � log p(x) is log surprise.
Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.166/188)
k-NN LSH DTs
Entropy
Definition 7.5.1 (Entropy)
Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:
H(X) , E log1
p(X)=X
x
p(x) log1
p(x)= �
X
x
p(x) log p(x) (7.23)
1/p(x) is surprise of x, � log p(x) is log surprise.
Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.167/188)
k-NN LSH DTs
Entropy
Definition 7.5.1 (Entropy)
Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:
H(X) , E log1
p(X)=X
x
p(x) log1
p(x)= �
X
x
p(x) log p(x) (7.23)
1/p(x) is surprise of x, � log p(x) is log surprise.
Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.168/188)
k-NN LSH DTs
Entropy
Definition 7.5.1 (Entropy)
Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:
H(X) , E log1
p(X)=X
x
p(x) log1
p(x)= �
X
x
p(x) log p(x) (7.23)
1/p(x) is surprise of x, � log p(x) is log surprise.
Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.169/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
High Entropy
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.170/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
High Entropy
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.171/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
x
p(x)
High Entropy
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.172/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
x
p(x)
High Entropy
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.173/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
x
p(x)
High Entropy
x
p(x)
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.174/188)
k-NN LSH DTs
Entropy Of Distributions
x
p(x)
Low Entropy
x
p(x)
High Entropy
x
p(x)
In Between
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.175/188)
I
⇒
k-NN LSH DTs
Binary Entropy
Binary alphabet, X 2 {0, 1} say.
p(X = 1) = p = 1 � p(X = 0).
H(X) = �p log p � (1 � p) log(1 � p) = H(p).
As a function of p, we get:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
H(p)
p
Note, greatest uncertainty (value 1) when p = 0.5 and leastuncertainty (value 0) when p = 0 or p = 1.
Note also: concave in p.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F55/57 (pg.176/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.177/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.178/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.179/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.180/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.181/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.182/188)
k-NN LSH DTs
Classification Trees and Entropy
We can measure entropy in each region.
Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes
R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)
A region’s posterior is essentially a histogram of class labels in region.
Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.
Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)
maxj,s
⇥H(p0(y)) �
�H(pR1(j,s)(y)) + H(pR2(j,s)(y))
�⇤(7.25)
Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.
Term: CART, for classification and regression trees.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.183/188)
k-NN LSH DTs
Trees, Expressivity, Bias, and Variance
Decision trees are extremely flexible.
Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).
Top-down Binary split treecan’t achieve all regions,ex:
DTs can be constructed using other procedures that can do this.
Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.184/188)
k-NN LSH DTs
Trees, Expressivity, Bias, and Variance
Decision trees are extremely flexible.
Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).
Top-down Binary split treecan’t achieve all regions,ex:
DTs can be constructed using other procedures that can do this.
Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.185/188)
min Loss +2. IT )
T
k-NN LSH DTs
Trees, Expressivity, Bias, and Variance
Decision trees are extremely flexible.
Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).
Top-down Binary split treecan’t achieve all regions,ex:
X1
X2
DTs can be constructed using other procedures that can do this.
Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.186/188)
k-NN LSH DTs
Trees, Expressivity, Bias, and Variance
Decision trees are extremely flexible.
Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).
Top-down Binary split treecan’t achieve all regions,ex:
X1
X2
DTs can be constructed using other procedures that can do this.
Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.187/188)
k-NN LSH DTs
Trees, Expressivity, Bias, and Variance
Decision trees are extremely flexible.
Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).
Top-down Binary split treecan’t achieve all regions,ex:
X1
X2
DTs can be constructed using other procedures that can do this.
Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.188/188)