189
Advanced Introduction to Machine Learning — Spring Quarter, Week 7 — https://canvas.uw.edu/courses/1372141 Prof. JeBilmes University of Washington, Seattle Departments of: Electrical & Computer Engineering, Computer Science & Engineering http://melodi.ee.washington.edu/ ~ bilmes May 11th/13th, 2020 Prof. JeBilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F1/57 (pg.1/188)

Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Advanced Introduction to Machine Learning— Spring Quarter, Week 7 —

https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

May 11th/13th, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F1/57 (pg.1/188)

Page 2: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Announcements

HW3 is due May 15th, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F2/57 (pg.2/188)

Page 3: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F3/57 (pg.3/188)

Page 4: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

x6

x3

x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3

x2x4

x5

x1

x3

x2x4

x1

x3

x2

x1

x2

x1

x3

x2x4

x5

x1

x2x4

x5

x1

x2

x5

x1

x2

=X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x4

X

x5

(x1, x2)� 66,2(x2)X

x3

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

=X

x5

(x1, x2)� 66,2(x2)X

x4

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

x5

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·X

x6

p(x1, x2, . . . , x6)

X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x3

X

x4

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

x5

(x3, x5)

| {z }� 65,3(x3)

=X

x3

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

x4

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

x3

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·X

x6

p(x1, x2, . . . , x6)

=

Reconstituted Graph Reconstituted Graph

O(r2)

O(r4)

O(r3)

O(r2)

x6

x5

x4

x3

O(r2)

O(r2)

O(r2)

O(r2)

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F4/57 (pg.4/188)

Page 5: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance isp

m/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F5/57 (pg.5/188)

Page 6: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F6/57 (pg.6/188)

Page 7: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0 < m.

Core advantage: If U ={1, 2, . . . , m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F7/57 (pg.7/188)

Page 8: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0 < m dimensions, sox(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i) � x(i)k22 (7.1)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the ith largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m ⇥ m0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F8/57 (pg.8/188)

Page 9: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Example: 2D and two principle directions

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F9/57 (pg.9/188)

Page 10: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

PCA vs. LDA

From Bishop 2006

PCA vs. Linear Discriminant Analysis (LDA).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F10/57 (pg.10/188)

Page 11: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Axis parallel projection vs. LDA

From Hastie et. al., 2009

+

++

+

Axis parallel (left) vs. LDA (right).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F11/57 (pg.11/188)

Page 12: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2exp

✓�1

2(x � µy)

|C�1y (x � µy)

◆(7.2)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �1

2(x � µ1)

|C�11 (x � µ1) +

1

2(x � µ0)

|C�10 (x � µ0)

(7.3)

= (C�1µ1 � C�1µ0)|x + µ0

|C�1µ0 � µ1|C�1µ1 (7.4)

= ✓|x + c (7.5)

✓ is a projection (m ⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F12/57 (pg.12/188)

Page 13: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Linear Discriminant Analysis (LDA), ` classesWithin class scatter matrix

SW =X

c=1

Sc where Sc =X

i:i is class c

(x(i) � µc)(x(i) � µc)

|(7.4)

and µc is the class c mean, µc = 1nc

Pi is of class c x(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (7.5)

First projection, maximize ratio of variances (recall PCA):a1 2 argmaxa2Rm

a|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S�1

W SB to get LDA reduction.LDA dimensionality reduction: find `0 ⇥m matrix linear transform on x,project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F13/57 (pg.13/188)

Page 14: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1 � e�m✏2/2 (7.11)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F14/57 (pg.14/188)

Page 15: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

k-NN classifier, multiclass data

Let D =�x(i), y(i)

ni=1

be a training set of n samples (as always).

Let Nk(x, D) ✓ D be the subset of D of size k which are the k nearestneighbors of x. That is |Nk(x, D)| = k and for any(x(i), y(i)) 2 Nk(x, D) we have that d(x(i), x) is no more than the kth

farthest data point in D from x, under distance d(·, ·).We can then estimate a posterior probability in binary case:

pkNN(y = c|x) =1

k

X

(x(i),y(i))2Nk(x,D)

1{y(i)=c} (7.15)

Valid probability 0 pkNN(y = c|x) 1 andP`

c=1 pkNN(y = c|x) = 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F15/57 (pg.15/188)

Page 16: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

k-NN classifier, various k

From Hastie et. al., 2009

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

..

. .. .. .. .. . .. . .. . .. . . . .. . . . . .. . . . . . .. . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . . . . . ............... ................. .................. .................... ....................... ......................... .......................... .......................... ............................ ............................. ............................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

ooo

o

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

ooo

oo oo

o

o

o

o

o

o

o

o

o

o

k-NN classifier, with k = 15

.. .. .. . . . . .. . . . . . .. . . . . . . . .. . . . . . . . ............ .............. ............... .................. ................... .................... ..................... .......................... ......................... ........................ ........................ ........................ .......................... .......................... .......................... ........................... ............................ .............................. ................................ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................. .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . ..................... ...................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . ....................... . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . ............. . . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . . ................. ................... .................... ................... .............. . . . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . .

oo

ooo

o

o

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

ooo

o

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

ooo

o

o

o

o

o

o

ooo

oo oo

o

o

o

o

o

o

o

o

o

o

k-NN classifier, with k = 1

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F16/57 (pg.16/188)

Page 17: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

Logistics Review

k-NN classifiers, e↵ective degrees of freedom

Hyperparameter is k (similar to � in ridge or lasso).A useful measure is n/k which is e↵ective degrees of freedom in ak-NN classifier. n/k is e↵ective number of neighborhoods within whichwe might fit a mean.

We can look at train/testerror as a function of n/kand consider relative to lo-gistic regression.

'HJUHHV�RI�)UHHGRP�ï�Q�N

Test

Erro

r

0.10

0.15

0.20

0.25

0.30

2 3 5 8 12 18 29 67 200

151 101 69 45 31 21 11 7 5 3 1

7UDLQTestBayes

/LQHDU

From Hastie et. al., 2009

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F17/57 (pg.17/188)

Page 18: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

1D manifolds in 2D ambient space

One-dimensional spiral manifold in 2D space

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.18/188)

Page 19: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

1D manifolds in 2D ambient space, two classes

One-dimensional spiral manifold in 2D space, along with PCA projection.

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.19/188)

Page 20: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

1D manifolds in 2D ambient space

One-dimensional spiral manifold in 2D space, along with PCA projection, 2class case.

�300 �200 �100 0 100x1

�200

�100

0

100

200

300

400

x2

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F18/57 (pg.20/188)

Page 21: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

2D Manifold Examples in 3d Ambient Space

(A): manifolds. (B) samples from manifold. (C) Inherent flattened 2Dmanifolds.

From: Roweiss & Saul, 1999

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F19/57 (pg.21/188)

Page 22: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.22/188)

Page 23: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.23/188)

Page 24: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.24/188)

Page 25: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.25/188)

Page 26: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Manifold Learning in ML

Goal: find the lower-dimensional manifold on which the data lies,within the ambient high-dimensional space where the is coordinized.

I.e., the high-dimensional data rests near or on an inherentlylow-dimensional manifold (if flattened) within the high dimensionalambient (input) space.

Useful in machine learning for non-linear methods for dimensionalityreduction.

Example algorithms: Isomap, Laplacian Eigenmaps, Kernel PCA, etc.

Rather than find the coordinates along the manifold of the data items,the k-NN approaches approximates distances on the manifold. I.e.,geodesic distance as distance between points along the k-NNgraph-represented manifold.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F20/57 (pg.26/188)

Page 27: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.27/188)

Page 28: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.28/188)

Page 29: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.29/188)

Page 30: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.30/188)

Page 31: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN, memory, parametric vs. non-parametric

k-NN classifiers do not require an explicit model (e.g., linear,non-linear) and are examples of non-parametric models, unlike theparametric models we’ve so far seen (but more like t-SNE).

k-NN classifiers need to memorize the entire training data set

Other parametric models include: linear models, logistic regression,most neural networks (including the deep variety), or anythinginvolving a fixed and finite number of parameters regardless of thetraining data size.

Other non-parametric models: k-NN, histogram density estimates,kernel density estimates, kernel/spline/wavelet based regression,support vector machines, kernel machines.

Non-parametric methods are useful in that methods to learn them willautomatically adjust the complexity to the natural complexity in thedata.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F21/57 (pg.31/188)

Page 32: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.

With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.32/188)

Page 33: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.33/188)

Page 34: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R

From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.34/188)

Page 35: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R

From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F22/57 (pg.35/188)

Page 36: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Recall: Bayes error

Recall next two slides from lecture 4.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F23/57 (pg.36/188)

Page 37: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

0/1-loss and Bayes error

Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.

Probability of error for a given x

Pr(h✓(x) 6= Y ) =

Zp(y|x)1{h✓(x) 6=y}dy = 1 � p(h✓(x)|x) (7.7)

To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.

Smallest probability of error is known is Bayes error for x

Bayes error(x) = miny

(1 � p(y|x)) = 1 � maxy

p(y|x) (7.8)

Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,

BayesPredictor(x) , argmaxy

p(y|x) (7.9)

Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F24/57 (pg.37/188)

Page 38: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Bayes Error, overall di�culty of a problem

For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.

Bayes error, various equivalent forms (Exercise: show last equality):

Bayes Error = minh

Zp(x, y)1{h(x) 6=y}dydx (7.13)

= minh

Pr(h(X) 6= Y ) (7.14)

= Ep(x)[min(p(y|x), 1 � p(y|x))] (7.15)

=1

2� 1

2Ep(x)[|2p(y|x) � 1|] (7.16)

For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).

Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F25/57 (pg.38/188)

Page 39: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1.

So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.39/188)

Page 40: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.

Xn ! X in the rth mean (r � 1), (written Xnr�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.40/188)

Page 41: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.41/188)

Page 42: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.42/188)

Page 43: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.

a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.43/188)

Page 44: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aside: modes of convergence

Xn ! X almost surely (Xna.s.��! X) if the set

{! 2 ⌦ : Xn(!) ! X(!) as n ! 1} (7.1)

is an event with probability 1. So all such events combined togetherhave probability 1, and any event left out has probability zero,according to the underlying probability measure.Xn ! X in the rth mean (r � 1), (written Xn

r�! X) if

E|Xrn| < 1 8n, and E(|Xn � X|r) ! 0 as n ! 1 (7.2)

Xn ! X in probability (written Xnp�! X) if

p(|Xn � X| > ✏) ! 0 as n ! 1, ✏ > 0 (7.3)

Xn ! X in distribution (written XnD�! X) if

p(Xn x) ! P (X < x) as n ! 1 (7.4)

for all points x at which FX(x) = p(X x) is continuous.a.s. ) p, r ) p, p ) D, and if r > s � 1, then r ) s.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F26/57 (pg.44/188)

F a sequin Ot r. V. s

1) X,

X, .tn, tis, . . . . . .

How might we measureh-w

Xu approach, X ?

There are four standard

woes :

27

b) AS.

⇒pros →dist

rtt

4)

Page 45: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).

Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.45/188)

Page 46: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.

Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.46/188)

Page 47: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.

✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.47/188)

Page 48: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.

h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.48/188)

Page 49: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).

Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.49/188)

Page 50: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).

We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.50/188)

Page 51: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.51/188)

Page 52: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.52/188)

Page 53: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

On Bayes error and ConsistencyLet h be a classifier, its error (or risk) is R(h) = Pr(h(X) 6= Y ).Define R⇤ = minh Pr(h(X) 6= Y ) to be Bayes error.Dn is a training set with n samples, randomly drawn from p(x, y),hence Dn is a random variable.✓(Dn) is learning process, mapping from data to learnt model.h✓(Dn) is the learnt model from training data Dn also a randomvariable with a conditional probability of error Pr(h✓(Dn)(X) 6= Y |Dn).Define R(h✓(Dn)|Dn) = Pr(h✓(Dn)(X) 6= Y |Dn).We consider how EDn

⇥R(h✓(Dn)|Dn)

⇤improves with n.

Definition 7.3.1 (Consistency)

A classification process is consistent for distribution p(x, y) if

EDn [R(h✓(Dn)|Dn)]p�! R⇤ as n ! 1.

Definition 7.3.2 (Universal Consistency)

A classification process is universally consistent if it is consistent for alldistributions p(x, y).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F27/57 (pg.53/188)

BT

Page 54: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?

As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.54/188)

Page 55: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).

Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.55/188)

Page 56: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.56/188)

Page 57: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.

Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.57/188)

Page 58: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!

Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.58/188)

Page 59: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y).

Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.59/188)

Page 60: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.60/188)

Page 61: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

How far away are we from Bayes error

Why is consistency and universal consistency a desirable property?As we get more data we know we can’t get worse and will approacheventually the best we can ever do, for a given distribution p(x, y).Universality means that we can expect this to be true regardless of thedistribution.

Ideally, we’d like to show that there is a family of classificationmethods that are universally consistent.Wasn’t known until relatively recently (in the “stone age”), in 1977Chuck Stone proved a famous theorem, that the k-NN classifier isuniversally consistent!Let hk-NN,Dn(x) be the decision decided by the k-NN binary classifier(majority vote over neighbors) on data set Dn drawn randomly fromp(x, y). Stone showed that:

Theorem 7.3.3 (Stone, 1977)

Let k ! 1 with increasing n such that k/n ! 0 as n ! 1. Then for allprobability distributions p(x, y), EDn [R(hk-NN,Dn)] ! R⇤ as n ! 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F28/57 (pg.61/188)

Page 62: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

k-NN and Bayes error

3-class example for various k and comparing to Bayes error (purple dashedlines).From Hastie et. al., 2009

15-Nearest Neighbors

. ........ ......... ..... ...... ......... ......................................................................................... ............................................................................................................................................................................................................................................................................ ........................................................................... .............................................................................. .. . .... .... ......................................... .................................................................. ..................................................................... ..................................................................................................................................................................................... ................................................................................................................................................ .............................................. .. .................................................................................... . .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... . .................................................................................................................. .................................................................................................................................................................................................................................. ..................................................................................................... ...... ..................................................................................................... ..... ..... ............................................................................................... .... ...................................................................................................... ...... ............................................................................................................ ........................................................................................................... ....................................................................... .................................. .............................................................. .... . . ........................ ............................................................. ........................ .................................................... ................. ............................................. ....... ... ........................................................................................................ .............................. . ...................... ......

.... .. . ..................... .............................. ........................................................... ......................................... .......................................................... ................................................................... ........................................................................... ............................................................................. ................................................................................ .............................................................................. .......................................................................... ......................................................................... ................................................................. .................................................. ................................................ ................................................................................................................................................................... .................................................................................................... ........................................................ ......... ...................................................................................................................................................... ... .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

o oo

oo

o

oo

oo

o

o

oo

o

o

oo

o

oo

o

o oo

oo

o

o

o

oo

oo o

o

oo

o

oo

o

o

o

o

o

o

ooo

o

o

o

o oo

o

o

o

o

o

o

oo

o

o

oo

o

oo

oo

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o o

ooo

ooo

o

o

o

oo

o

ooo

o

oo

o

o

o

o

o

ooo

o

o

o o oo

o

o

o

o o

o

oo

o

oo

oo

o

o o

o

o o

o

o o

o

o

o o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

oo

o o

o

o

o o

oo

oo

oo

o

oo o

o

o

o

oo o

o

o

o

o

o oo

o

o

o

o o

o

o

oo o

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o

oo

o

ooo

o

o o

oo

o

o

1-Nearest Neighbor

................. ................. ............ .............. .................. ....................... ... ................ ..... ............... ..... ............. .... ................ ......... ..... ..... ...... ................... ..... ...... ...... .... ....... ... .......... ... .... .. .... . ........ ............ ............ ...... ............ .......... ........ .............. ........ ........ ............. .......... ...... ............... ....... ............. ............. ......... .......................... ............. ......................... . ..... .. .. ............................. ....... .. . ...... .......................... ....... ............ .............. ............................... .............................. ........................ ............ ....................... ........................ ....................... .......................................... ...... ..................... ..................... ..... ......................... .......... ............... ........ ....... ... ........................... .... ... ................ ....................... ........................ . ......... ..... .... .............. ............................. ...... ... ........................................... .......... ... ......................................... ...... .......... ...................................... .......... ......................................... ................................................... ...................................... ..................................... .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ........................................................................................................ .......................................................................................... .......... ......................................................................................... ........... ............................................................................................ ....................... .............................................................................. .................. ............................................................................... ............... ......................................................... .............. ......... ... ............................................................ ............ ....... ... .............................................................. .......... ...... .... ............................................................ ...... .. ....... .................................................................... . .. .............. ................................................................... ... ............ .... ... ........................................................ .. ............. ... ......................................................... .............. ..... ... ..... ..... ................................................. ............. ...... . ....... ... ................................................ ............... .. .......... ...... ..... ... .............................................. .................. .......... ........ ............................................. ..... .......... ......... ............................................... ... ..... ......... ................................................ . ... ... ........ ............................................... ..... ... ....... ............ ............................................ ............................. .......................... .................... ....

................................................. ............ ................. ................................................................. .................. ..................... .. ..................... ...................................... .............................................................................. .. ... .................................... ...... . ...... ...................................... ...... . ..... ............. ....................................................................................... .. . ................................................................ ... ......................................................... ...... ..................................................... ..... ........ .................................................... ... . .. .. ... ................................... .. .. .. .. .................................. ........... .... . ....... .... ............................................... ..... ..... ................................................... ...... . .................................................. .... .. ............................................................... ..... .......................................................... .............................................................................. ........................................................................... ............................................................................ .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

o oo

oo

o

oo

oo

o

o

oo

o

o

oo

o

oo

o

o oo

oo

o

o

o

oo

oo o

o

oo

o

oo

o

o

o

o

o

o

ooo

o

o

o

o oo

o

o

o

o

o

o

oo

o

o

oo

o

oo

oo

oo

o

o

o

o

o

o

o

oo

oo

o

o

o

o o

ooo

ooo

o

o

o

oo

o

ooo

o

oo

o

o

o

o

o

ooo

o

o

o o oo

o

o

o

o o

o

oo

o

oo

oo

o

o o

o

o o

o

o o

o

o

o o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

oo

o o

o

o

o o

oo

oo

oo

o

oo o

o

o

o

oo o

o

o

o

o

o oo

o

o

o

o o

o

o

oo o

o

o

ooo

o

o

o

o

o

oo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o

oo

o

ooo

o

o o

oo

o

o

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F29/57 (pg.62/188)

Page 63: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way.

Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.63/188)

Page 64: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn.

We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.64/188)

Page 65: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.65/188)

Page 66: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.66/188)

Page 67: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.67/188)

Page 68: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.68/188)

Page 69: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

What about 1-NN?Even the 1-NN classifier is good in a certain way. Let h1-NN,Dn(x) bethe nearest neighbor, classifier, based on data set Dn. We have:

Theorem 7.3.4 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] = E[2⌘(X)(1 � ⌘(X)] (7.5)

where ⌘(x) = E[Y |X = x] = p(y|x).

As a consequence, we have

Theorem 7.3.5 (Cover & Hart, 1967)

For the 1-NN rule and for any distribution p(x, y), we have

limn!1

EDn [R(h1-NN,Dn)] 2R⇤ (7.6)

Thus, simple 1-NN rule is, asymptotically, no worse than twice Bayeserror. If Bayes is small, so is 1-NN (eventually, and on average!).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F30/57 (pg.69/188)

Page 70: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?

(viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.70/188)

Page 71: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k?

(viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.71/188)

Page 72: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.72/188)

Page 73: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.73/188)

Page 74: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.

Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.74/188)

Page 75: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Aspects of k-NN

Universal consistency, converges on average to Bayes rule if k ! 1,k/n ! 0 as n ! 1.

How does it behave for finite k? k is smoothing parameter (analogousto � tradeo↵ parameter) as we have seen. How to choose k? (viacross-validation)

Can we weight the nearest neighbors? Why should they always countequally? Perhaps by distance (suggesting kernel estimators)?

What distance metric? Euclidean, or Lp distance d(x, x0) = kx � x0kp.Memory: What if we reduce the training data size, come up withexemplars. Clustering, EM, unsupervised learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F31/57 (pg.75/188)

Page 76: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Computational Considerationsk-NN classifiers require memory at least to store the n data items.With no pre-computation, a naive implementation of a query takesO((n log n) + nm) or O((n log k) + nm) time (O(n) distances sortedor top k, and each distance O(m)). Also, we need n queries!

There are data structuresfor spatial access methodsfor doing fast nearestneighbor search, usingmethods analogous tothings such as the R-tree.

Q

B1

B2

B3 B4

B5 B6

B7

B8 B9

B10

B11

B13B12

R

From Cheung & Fu, “Enhanced Nearest Neighbour Search on the R-Tree”

Good reference: Chen, Fang, Saad, “Fast Approximate kNN GraphConstruction for High Dimensional Data via Recursive LanczosBisection”, 2009.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F32/57 (pg.76/188)

Page 77: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Nearest Neighbor Search and Computation

Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.

Recall m-dimension spherical Gaussian, zero mean, variance �2

p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)

As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.77/188)

Page 78: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Nearest Neighbor Search and Computation

Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.

Recall m-dimension spherical Gaussian, zero mean, variance �2

p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)

As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.78/188)

Page 79: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Nearest Neighbor Search and Computation

Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.

Recall m-dimension spherical Gaussian, zero mean, variance �2

p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)

As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.79/188)

Page 80: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Nearest Neighbor Search and Computation

Nearest Neighbor Search (NNS) Problem: Given a set of n points�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Naıve costs O(nm) both time and memory. To find all points’ nearestneighbors (1-NN) is O(n2m) naively.

Recall m-dimension spherical Gaussian, zero mean, variance �2

p(x) =1

(2⇡�2)m/2exp

✓�kxk22

2�2

◆(7.7)

As m gets large, this Gaussian (as we saw in HW3) starts to behavevery di↵erently.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F33/57 (pg.80/188)

Page 81: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.81/188)

Page 82: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.82/188)

Page 83: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.83/188)

Page 84: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.84/188)

Page 85: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.

pm sometimes called radius of the Gaussian.

For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.85/188)

Page 86: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.

For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.86/188)

Page 87: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Gaussian Annulus Theorem: HW3 ReduxHW3: when �2 = 1, very little probability mass over unit r = 1 ball sinceso little volume there; only when r =

pm is mass over ball significant.

Beyond r =p

m mass does not increase much (as fast as ball volumecontinues to increase, density drops at higher rate). I.e., HW3 proves:

Theorem 7.4.1 (Gaussian Annulus Theorem)

For any m-dimensional spherical Gaussian with common unit variance, forany �

pm, all but at most 3e�c�2

of the probability mass lies withinannulus

pm � � kxk2

pm + �, where c > 0 is a constant.

Theorem says that Pr(��kXk2 �

pm�� �) > 1 � 3e�c�2

.

Choose any 0 < ✏ = 3e�c�2to be the fraction to ignore and set �

achieve that fraction (easy thanks to exp(·)). As m gets big, � rangebecomes negligible since |

pm + � � (

pm � �)|/

pm ! 0 as m ! 1.

E[kXk22] =Pm

i=1 E[X2i ] = mE[X2

i ] = m, mean squared distance of apoint to center is m. As m increases, things concentrate at this distance.p

m sometimes called radius of the Gaussian.For details, see the text by Blum, Hopcroft, and Kannan, 2018.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F34/57 (pg.87/188)

int.TN

( nm . o)

Page 88: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.

Possible solution: Let f : Rm ! Rm0be a projection from m to

m0 ⌧ m dimensions.

Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?

Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.88/188)

.

**.

Page 89: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions.

Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.89/188)

Page 90: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?

x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.90/188)

Page 91: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.

Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.91/188)

Page 92: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R?

Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.92/188)

Page 93: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?

Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.93/188)

Fb (H = fs (g)

Page 94: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal.

Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡

pm0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.94/188)

Page 95: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal. Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.

We claim that for any x, kf(x)k2 ⇡p

m0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.95/188)

Page 96: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal. Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡

pm0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.96/188)

Page 97: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal. Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡

pm0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.

To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.97/188)

Page 98: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projections to speed NN Search (NNS)In lower dimensionality space m0 << m, NNS much faster.Possible solution: Let f : Rm ! Rm0

be a projection from m tom0 ⌧ m dimensions. Is x and y near if f(x) near f(y)?x = (100, 0, 0, . . . , 0) 2 Rm, y = (0, 0, . . . , 0) 2 Rm.Ex-a:fa(x) = x1 2 R? Ex-b: fb(x) = x2 2 R?Approach: pick m0 vectors u(1), u(2), . . . , u(m0) 2 Rm drawnindependently at random from unit variance spherical Gaussian p(x).Note, the u(i)s themselves are neither unit length (mean radius

pm)

nor necessarily orthogonal. Define f : Rm ! Rm0as

f(x) = (hu(1), xi, hu(2), xi, . . . , hu(m0), xi) 2 Rm0(7.8)

as the vector of inner (dot) products.We claim that for any x, kf(x)k2 ⇡

pm0kxk2 with high probability.

Note, u(i)s are not unit, so norms in lower dimensional space are larger.To the extent this is true, distances can be approximated in low-D:

kx � yk2 =

pm0

pm0

kx � yk2 ⇡ 1pm0

kf(x � y)k2 =1pm0

kf(x) � f(y)k2 (7.9)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F35/57 (pg.98/188)

Page 99: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.99/188)

Page 100: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.100/188)

Page 101: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.

So probability of norm of projection deviating fromp

m0kxk2 isexponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.101/188)

Page 102: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.

Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.102/188)

Page 103: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.

This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.103/188)

Page 104: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Random Projection TheoremSpecifically, it can be shown that:

Theorem 7.4.2 (Random Projection Theorem)

Let x 2 Rm and f as given. Then for ✏ 2 (0, 1),

Pr⇣���kf(x)k2 �

pm0kxk2

��� � ✏p

m0kxk2⌘

3e�cm0✏2 (7.10)

where probability is taken over random draws of u vectors.

Follows from the Gaussian Annulus Theorem.So probability of norm of projection deviating from

pm0kxk2 is

exponentially small in m0.Given n samples x(i), the probability of any of the O(n2) distancesdeviating by such an amount is also small (say �) by the unionbound (Pr([iAi)

Pi Pr(Ai)) as long as if m0 is not too small, i.e.,

m0 � 3 log(n)/(c✏2), giving � = 3/n.This means we can accurately calculate approximate distances inm0-dimensional space after this random projection.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F36/57 (pg.104/188)

W- hip .

Page 105: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.

m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!

Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.105/188)

Page 106: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.

Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!

Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.106/188)

Page 107: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288.

Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.107/188)

ocntmkom.mil

Page 108: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!

Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.108/188)

Page 109: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Johnson-Lindenstrauss Lemma

Theorem 7.4.3 (Johnson-Lindenstrauss Lemma)

For any ✏ 2 (0, 1), any n, let m0 � 3c✏2 log n. Let X be set of n points in

Rm. The random projection f : Rm ! Rm0has the property that for all

x(i), x(j), then with probability at least 1 � 3/(2n), we have:

(1 � ✏)p

m0kx(i) � x(j)k kf(x(i)) � f(x(j))k (1 + ✏)p

m0kx(i) � x(j)k(7.11)

I.e., the probability of all�n2

�< n2/2 pairs of points so deviating is at

most 3/(2n), so so can be used to do fast NNS.m0 grows only logarithmically with n. In high dimensions m0 ⌧ meven for small ✏.Example: n = 1010, in proof of theorem c = 96 works, ✏ = 0.05 thenm0 � 3/(96 · 0.052) log(1010) = 288. Speedup: 1010/288 = 34 million!Other examples: (n = 10000,m0 = 115), (n = 1000,m0 = 86),(n = 100,m0 = 57).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F37/57 (pg.109/188)

-

Page 110: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.110/188)

Page 111: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.111/188)

OH

Page 112: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.112/188)

Page 113: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.113/188)

Page 114: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.114/188)

Page 115: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Approximate Nearest Neighbor Search: LSH

Recall: nearest neighbor search (NNS) problem: Given a set of npoints

�x(i)

ni=1

, x(i) 2 Rm, finding nearest to query q 2 Rm is

argmini d(x(i), q).

Rather than trying to find argmini d(x(i), q) we find some x(i) so thatd(q, x(i)) c minj d(p, x(j)) for some c � 1 and with some (high)probability.

Locality Sensitive Hashing (LSH) helps us achieve this.

Normally, cryptographic hash functions (e.g., SHA-1, SHA-256, etc.,computed with linux’s shasum) mean small di↵erences map todi↵erent buckets, unlikely collision.

In LSH, by contrast, small di↵erences likely to map to same bucket,collision likely with similar inputs.

LSH is similar to clustering (which we will discuss in a future lecture).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F38/57 (pg.115/188)

Page 116: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.116/188)

Page 117: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.117/188)

d. It, g)or ⇒ x. g

close

dung)> c.r ⇒ t, y

fan.

Page 118: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.118/188)

9family,

Page 119: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.119/188)

11 a

Page 120: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.120/188)

c' A

Page 121: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)).

When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.121/188)

Page 122: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH definitionGiven two probabilities p1, p2 with p1 � p2, a constant c > 1, and adistance metric d : X ⇥X ! R, X is the domain of the input (e.g., Rm).

A family of hash functions is defined relative to probabilities p1, p2, theconstant c, a notion of closeness (radius r), and the distance metric d.

Definition 7.4.4 (LSH)

A family of hash functions h is said to be (r, c, p1, p2)-LSH with p1 � p2,r > 0, and c > 1 if (when h is drawn uniformly at random from H):

(a) Pr[h(x) = h(y)] � p1 when d(x, y) r (close points have high probability ofmapping to same bucket).

(b) Pr[h(x) = h(y)] p2 when d(x, y) � cr (distant points have low probabilitymapping to same bucket).

When things are close (d(x, y) r) then we have high probability(� p1) that the two vectors x, y will hash to the same place(h(x) = h(y)). When things are far (d(x, y) � cr), then we have lowprobability ( p2) that two vectors x, y will hash to the same place.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F39/57 (pg.122/188)

LSU Allows a tradeoff between computation and statistical accuracy .

HA EH

Page 123: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH instances

We can define LSH hash functions for a variety of problems, includingfor the

Hamming metric, X = [0, 1]m, d(x, y) =Pm

i=1 1{xi 6=yi}.

A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.

Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,

d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,

d(x, y) = kx � yk2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.123/188)

Det Chia -Lin Distance :

gin x. yeah,chariot

Page 124: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH instances

We can define LSH hash functions for a variety of problems, includingfor the

Hamming metric, X = [0, 1]m, d(x, y) =Pm

i=1 1{xi 6=yi}.

A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,

d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,

d(x, y) = kx � yk2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.124/188)

Page 125: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH instances

We can define LSH hash functions for a variety of problems, includingfor the

Hamming metric, X = [0, 1]m, d(x, y) =Pm

i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.

Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,

d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,

d(x, y) = kx � yk2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.125/188)

He

÷:÷÷÷÷÷

Page 126: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH instances

We can define LSH hash functions for a variety of problems, includingfor the

Hamming metric, X = [0, 1]m, d(x, y) =Pm

i=1 1{xi 6=yi}. A very simpleexample of a HSH function for Hamming is, h(x) = x[a] for somea 2 {0, 1, . . . , m � 1} with a drawn uniformly at random. We study thisin HW4.Also there exist LSH functions for: (1) Jaccard similarity, X = 2V ,

d(A, B) = |A\B||A[B| and for (2) Euclidean distance, X = Rm,

d(x, y) = kx � yk2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F40/57 (pg.126/188)

Page 127: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.127/188)

Page 128: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.128/188)

Page 129: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.129/188)

OO

ka m

Page 130: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.130/188)

2€ - - -

-

Page 131: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

LSH

In general, h is drawn uniformly random from H in order to get aboveprobabilities (i.e., Pr[h(x) = h(y)] � p1 when d(x, y) r andPr[h(x) = h(y)] p2 when d(x, y) � cr).

While p1 � p2, sometimes p1 is too small (and thus Pr[h(x) = h(y)] istoo small, meaning we never get a hit)

To be able to have more flexible and wider range of values for p1, p2,we construct compound LSHs by using a set of LSHs multiple times,drawn randomly. I.e., g(x) = (h1(x), h2(x), . . . , hk(x)) where hi israndomly drawn from H.

We sometimes use multiple g’s drawn randomly too.

Homework 4 explores this further, and we see we can get more flexiblevalues for p1 and p2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F41/57 (pg.131/188)

Page 132: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

X E X 12/1 is big . in bitsto represent a give XEN

takes many many bits .

•D= { H"

,H",

.. .

,XC"} n is large

but look )

is"

qe Set ( D )"

?" ⇐ 2

I -

linear scare ( n ) { INbinary scores 0(logs ) .

-

:#can u do OCD

-

h :z→oy ¥1Inflate)

E

i ' glad ⇐ 2124am-

to check it x is in the set, we allocate an

array of size L and check slot

⇐ (x) mod l) to see it its's occupied.

H'T

:X#E

Page 133: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Recall, from lecture 3

Recall from lecture 3 the following slide.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F42/57 (pg.132/188)

Page 134: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Linear Voronoi tessellation

When discriminant functions are linear (i.e., h✓(j), xi) boundaries are lines.Example in two dimensions :

From https://en.wikipedia.org/wiki/Voronoi_diagram

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F43/57 (pg.133/188)

Page 135: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Using a Tree to Make Decisionshttps://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c

4 leaf (or terminal) nodes3 internal nodes1 root node (as always)

r

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F44/57 (pg.134/188)

Page 136: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Iris flower classifier via petal length and width

https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F45/57 (pg.135/188)

Page 137: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression with a tree, constant decisions

https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F46/57 (pg.136/188)

Page 138: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression with a tree, linear decisions

https://medium.com/x8-the-ai-community/decision-trees-an-intuitive-introduction-86c2b39c1a6c

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F47/57 (pg.137/188)

Page 139: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).

Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)

Example: (From Hastie et. al 2009 text).

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.138/188)

Page 140: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)

Example: (From Hastie et. al 2009 text).

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.139/188)

Page 141: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)

Example: (From Hastie et. al 2009 text).

t1

t2

t3

t4

R1

R2

R3

R4

R5

X1

X2

|

R1 R2 R3

R4 R5

X1 t1

X2 t2 X1 t3

X2 t4

X1X2

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.140/188)

Rs

RyB

R2m

Page 142: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Decision Trees

Partition space into rectangular regions, each region we make a finaldecision (i.e., regression value, or classification).Regions formed by recursively splits space based on binary questionsabout one coordinate of input at a time (i.e., “Is x7 < 92?”)

Example: (From Hastie et. al 2009 text).

t1

t2

t3

t4

R1

R2

R3

R4

R5

X1

X2

|

R1 R2 R3

R4 R5

X1 t1

X2 t2 X1 t3

X2 t4

X1X2

Corresponds to

h(x) =5X

i=1

ci1{x2Ri} (7.12)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F48/57 (pg.141/188)

CJ

Cyg

Cz9

Page 143: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression Trees

Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.

h✓(x) =rX

k=1

ck1{x2Rk} (7.13)

Given data D =�x(i), y(i)

, how to grow? Minimize square errorP

i(y(i) � h✓(x(i)))2.

Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:

(7.14)

ck =1���i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.142/188)

Page 144: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression Trees

Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.

h✓(x) =rX

k=1

ck1{x2Rk} (7.13)

Given data D =�x(i), y(i)

, how to grow? Minimize square errorP

i(y(i) � h✓(x(i)))2.

Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:

(7.14)

ck =1���i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.143/188)

o# o anti::::: :. .

Page 145: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression Trees

Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.

h✓(x) =rX

k=1

ck1{x2Rk} (7.13)

Given data D =�x(i), y(i)

, how to grow? Minimize square errorP

i(y(i) � h✓(x(i)))2.

Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:

(7.14)

ck =1���i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.144/188)

( I

Page 146: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Regression Trees

Each of the r regions could correspond to its own function, i.e.,h✓k(x). Here we keep it simple, each region is constant, h✓k(x) = ck.

h✓(x) =rX

k=1

ck1{x2Rk} (7.13)

Given data D =�x(i), y(i)

, how to grow? Minimize square errorP

i(y(i) � h✓(x(i)))2.

Once regions are decided, best within-region constant predictor is thesimple average ck = average(y(i)|x(i) 2 Rk), i.e,:

(7.14)

ck =1���i : x(i) 2 Rk

��X

i:x(i)2Rk

y(i) (7.15)

How to find the r regions?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F49/57 (pg.145/188)

Page 147: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, binary splitting

Splitting variable j 2 [m] with split point s 2 R, defines two half planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)

Use optimization, find the j, s that achieves the best (minimum) in:

minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.146/188)

xz)R,

R2

¥#Ot

chooses to split on

is equal to m . -

Page 148: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, binary splitting

Splitting variable j 2 [m] with split point s 2 R, defines two half planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)

Use optimization, find the j, s that achieves the best (minimum) in:

minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).

For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.147/188)

aim a- sun, un. x" 's.

..¥q•÷÷.consider Jcc) = Iz ( x" )- c)2

& finding argmch Jcc)c EIR : e th t" '

E- rel - FI, Cx"- c ) - o

Est"'=

c - n - c

Page 149: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, binary splitting

Splitting variable j 2 [m] with split point s 2 R, defines two half planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)

Use optimization, find the j, s that achieves the best (minimum) in:

minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.

Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.148/188)

0cm - n )

j

D= { (x"

,y") } +"' EIR

"x''''= HI

,. .

, ,. . .,IT

"÷:i÷÷÷:÷÷÷: ocm.nl''II

• consideration. sfiihtsbmet.n,x,"") ) 0cm - Klugh)

Page 150: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, binary splitting

Splitting variable j 2 [m] with split point s 2 R, defines two half planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.16)

Use optimization, find the j, s that achieves the best (minimum) in:

minj,s

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5 (7.17)

where ck = average(y(i)|x(i) 2 Rk).For each j, can find s by sorting the data via xj and scanning in orderto achieve min. Select that j, s as the split point.Can also be seen as maximizing the MSE gain

maxj,s

0

@X

x(i)

(y(i) � c0)2 �

2

4X

x(i)2R1(j,s)

(y(i) � c1)2 +

X

x(i)2R2(j,s)

(y(i) � c2)2

3

5

1

A

(7.18)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F50/57 (pg.149/188)

I2oz

¥, Cy" - e.)

2

Page 151: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.

When to stop?

One option: stop when gain falls below threshold.

Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.150/188)

Page 152: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop?

One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.151/188)

Page 153: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.

Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.152/188)

Page 154: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).

Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.153/188)

4,

→ %.

^

y C

¥, → In..

Page 155: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk.

|T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.154/188)

Page 156: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes,

nk = |�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k

,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.155/188)

Page 157: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k,

ck = (1/nk)P

x(i)2Rky(i) be prediction in region k

,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.156/188)

-DT specific

Page 158: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.157/188)

Page 159: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.158/188)

Page 160: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.159/188)

f.tradeofcoefficient.

W✓ regular. 7k.loss

Page 161: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).

weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.160/188)

Page 162: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Greedy strategy, recursive binary splitting

Given R1 and R2, apply same strategy to each of R1 and R2.When to stop? One option: stop when gain falls below threshold.Better: minimize cost/complexity. Grow to large tree T0. Let T ⇢ T0

be any subtree (via pruning internal nodes, merging regions).Let: k be a terminal (leaf) node, with region Rk. |T | be the numberof terminal nodes, nk = |

�x(i) 2 D : x(i) 2 Rk

| be number of data

points in region k, ck = (1/nk)P

x(i)2Rky(i) be prediction in region k,

Qk(T ) = (1/nk)P

x(i)2Rk(y(i) � ck)2 be error in region k.

Overall cost complexity defined as

C↵(T ) =

|T |X

k=1

nkQk(T ) + ↵|T | (7.19)

Goal, for each ↵ find the T↵ ⇢ T0 that minimizes C↵(T↵).weakest link pruning (Brieman et al. 1984) solves this e�ciently bysuccessively collapsing internal node w. smallest per-node increase inP|T |

k=1 nkQk(T ).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F51/57 (pg.161/188)

Page 163: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees

At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:

p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.

With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:

Classification Error1

nk

X

i2Rk

1{y(i) 6=yk} (7.21)

Entropy in region k �X

j=1

pk(y = j) log pk(y = j) (7.22)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.162/188)

<,

f '

I~ ,

c - - ' ilc- .

I r re( s-

r e I-.- .

i. /÷÷:i÷÷÷ :c-

e-l r f l u

\ o

X o C T,-

-- l

( s - e -- c - -

-Y l U ,

< I\

✓ - ,LL.

Page 164: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees

At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:

p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.

How to judge each region? Several ways:

Classification Error1

nk

X

i2Rk

1{y(i) 6=yk} (7.21)

Entropy in region k �X

j=1

pk(y = j) log pk(y = j) (7.22)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.163/188)

Page 165: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees

At each leaf, predict a class rather than a real value. Produceregion-specific posterior distribution to make classification decisions fora x 2 Rm. With nk = count of training samples in region k, we have:

p(y = j|x) =1

nk(x)

X

x(i)2Rk(x)

1{y(i)=j} = pk(x)(y = j) (7.20)

where k(x) is the region containing x. Make decisions usingy(x) = argmaxj p(y = j|x). So region k always decides class yk.With this we can use the same greedy strategy to grow the tree fromtop down.How to judge each region? Several ways:

Classification Error1

nk

X

i2Rk

1{y(i) 6=yk} (7.21)

Entropy in region k �X

j=1

pk(y = j) log pk(y = j) (7.22)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F52/57 (pg.164/188)

Page 166: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.165/188)

Page 167: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.166/188)

Page 168: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.167/188)

Page 169: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.168/188)

Page 170: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy

Definition 7.5.1 (Entropy)

Given a discrete random variable X over a finite sized alphabet, the entropyof the random variable is:

H(X) , E log1

p(X)=X

x

p(x) log1

p(x)= �

X

x

p(x) log p(x) (7.23)

1/p(x) is surprise of x, � log p(x) is log surprise.

Entropy is typically units of “bits” (logs base 2) but also can be unitsof “nats” (logs base e). For optimization it doesn’t matter.

Measures the degree of uncertainty in a distribution.

Measures the disorder or spread of a distribution.

Measures the “choice” that a source has in choosing symbolsaccording to the density (higher entropy means more choice)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F53/57 (pg.169/188)

Page 171: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

High Entropy

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.170/188)

Page 172: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

High Entropy

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.171/188)

Page 173: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

x

p(x)

High Entropy

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.172/188)

Page 174: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

x

p(x)

High Entropy

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.173/188)

Page 175: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

x

p(x)

High Entropy

x

p(x)

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.174/188)

Page 176: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Entropy Of Distributions

x

p(x)

Low Entropy

x

p(x)

High Entropy

x

p(x)

In Between

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F54/57 (pg.175/188)

I

Page 177: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Binary Entropy

Binary alphabet, X 2 {0, 1} say.

p(X = 1) = p = 1 � p(X = 0).

H(X) = �p log p � (1 � p) log(1 � p) = H(p).

As a function of p, we get:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H(p)

p

Note, greatest uncertainty (value 1) when p = 0.5 and leastuncertainty (value 0) when p = 0 or p = 1.

Note also: concave in p.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F55/57 (pg.176/188)

Page 178: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.177/188)

Page 179: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.178/188)

Page 180: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.179/188)

Page 181: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.180/188)

Page 182: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.181/188)

Page 183: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.182/188)

Page 184: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Classification Trees and Entropy

We can measure entropy in each region.

Again, splitting variable j 2 [m] with split point s 2 R, defines twohalf planes

R1(j, s) = {x : xj s} R2(j, s) = {x : xj > s} (7.24)

A region’s posterior is essentially a histogram of class labels in region.

Starting from one large region, we find the variable j and split s that,when made, reduces the entropy the most.

Maximizing information gain, split region R0 into R1(j, s) and R2(j, s)

maxj,s

⇥H(p0(y)) �

�H(pR1(j,s)(y)) + H(pR2(j,s)(y))

�⇤(7.25)

Leads to regions with lower entropy, higher certainty, least diversity ineach region, creating homogeneous regions.

Term: CART, for classification and regression trees.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F56/57 (pg.183/188)

Page 185: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.184/188)

Page 186: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.185/188)

min Loss +2. IT )

T

Page 187: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

X1

X2

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.186/188)

Page 188: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

X1

X2

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.187/188)

Page 189: Advanced Introduction to Machine Learning€¦ · ¥ method of the bootstrap 6. Inference Methods ¥ probabilistic inference ¥ MLE, MAP ¥ belief propagation ¥ forward/backpropagation

k-NN LSH DTs

Trees, Expressivity, Bias, and Variance

Decision trees are extremely flexible.

Like k-NN methods, DT methods in general have high variance andlow bias (they can flexibly fit any data sets). With a tall tree, a DTcan perfectly fit any training data with non-conflicting labels (i.e.,di↵erent y for the same x). Like nearest neighbor, but neighborhoodsare rectangular (for top-down greedy tree procedure).

Top-down Binary split treecan’t achieve all regions,ex:

X1

X2

DTs can be constructed using other procedures that can do this.

Another advantage of trees, they lead to interpretable machinelearning models since all decisions are based on original inputs, ratherthan based on mysterious learnt non-convex combinations thereof.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 7 - May 11th/13th, 2020 F57/57 (pg.188/188)