Nonparametric Active Learning - Robert Nowak

Nonparametric Active Learning

Robert Nowak [email protected]

Steve Hanneke Toyota Technological Institute at [email protected]

Active Learning and Inductive Bias

Model Bias: Standard active learning algorithms perform no better than passive or might even converge to suboptimal solutions!

feature 1

feat

ure

2

best linear classifier

suboptimal linear classifier

Two Faces of Active Learning

(a) (b) (c)

+1-1

label examples close to estimated decision boundary

find clusters in unlabeled data and label one representative from each

Dasgupta, Sanjoy. "Two faces of active learning." Theoretical computer science 412.19 (2011): 1767-1781.

Goal: Use nonparametric (or overparameterized) models to avoid bias and design active learning algorithms that exploit intrinsic structure in data

Hierarchical Clustering for Active Learning

Theorem: If it is possible to prune the cluster tree to m leaves that are fairly pure in the labels of their constituent points, then O(m) labeled examples suffice to accurately label the entire dataset

Dasgupta and Hsu. "Hierarchical sampling for active learning." ICML 2008

1. hierarchically cluster unlabeled data

2. Label examples to test homogeneity of each cluster

Combining Active and Semi-Supervised Learning

Propagate labels to rest of graph: e.g., nearest-neighbors, graph-Laplacian, etc

Zhu, Xiaojin, John Lafferty, and Zoubin Ghahramani. "Combining active learning and semi-supervised learning using gaussian fields and harmonic functions." ICML 2003 workshop. Vol. 3. 2003.

Using prior model based on graph-Laplacian, select labels to minimize predicted risk

Construct nearest neighbor graph of unlabeled dataset

Provably Correct Algorithm: Binary Search on Graphbasic idea:

randomly select nodes for labeling UNTIL at least one is red and one is bluethen find all shortest paths between red and blue labeled nodes and bisect the shortest shortest path

Binary Search on Graph

propagate labels to rest of graph: e.g., nearest-neighbors, graph-Laplacian, etc

Dasarathy, Gautam, RN, and Xiaojin Zhu. Conference on Learning Theory. 2015.

Recursive application of shortest-shortest path bisection efficiently identifies cutset, partitioning graph into pure-labeled components

Theorem: If cutset consists of m edges, then O(m) labeled examples suffice to accurately label the entire dataset

Active Learning with Kernels and Neural Nets

Active learning based on nearest neighbor graphs and clustering can be effective, but require two separate steps

1. build graph or partition on unlabeled dataset2. exploit graph/cluster structure for active learning

Can we develop similar procedures that can be applied directly to popular classifiers like kernel methods and neural networks?

© 2018 Robert Nowak

Introduction to Neural Networks

1 Neural Networks

A multilayer neural network [1] is a mapping with the following form

y = WL f(WL�1 · · · f(W1x) · · · )) ,

where L is the number of layers, Wj is the weight matrix for layer j = 1, . . . , L, and f is a (sub)differentiable“activation” function (e.g., ReLU or sigmoid) and for any vector v 2 Rk the nonlinearity is applied element-wise f(v) = [f(v1) , f(v2) , . . . , f(vk)]T . In this note we will focus on the case where the output y is scalar(e.g., binary classification), and so WL is a vector. Often a bias or offset is also included. For example, ifxj denotes the input to layer j, then the output can be written as xj+1 = f(Wjxj + bj), where bj is avector of the biases for each node in that layer. Equivalently, we can augment the vector xj with a 1 and adda corresponding additional row equal to bj to the matrix Wj , and thus maintain the notationally compactform above.

2 Kernel Machines are Two-Layer Networks

Recall the idea of a kernel machine, defined by a positive definite kernel k : Rd ⇥Rd ! R. Let {xi, yi}ni=1be a set of training data and consider mappings of the form

y =nX

i=1

↵i k(xi,x) .

Many kernels are expressed in terms of the inner product xTi x or the norm kxi�xk2 = xT

i xi�2xTi x+xTx.

In such cases we can interpret the kernel machine as a two-layer network in which the first layer weightsare W T

1 = [x1 . . . ,xn], the activation function f is defined by the kernel, and the output layer weightvector is W2 = [↵1, . . . ,↵n]. In the case of kernels defined in terms of the norm kxi � xk2 the termsbi := xT

i xi + xTx can be represented by the bias of that node.

3 Linear-in-Parameters Models vs. Nonlinear Parameterizations

Neural networks with nonlinear activation functions and kernel methods with nonlinear kernels both repre-sent nonlinear mappings. However, there is one key distinction. In kernel machines, the input layer weightsare fixed to be the training example feature vectors {xi} and only the output layer weights are adjusted viaoptimization to fit the model to the training labels {yi}. This sort of model is called a linear-in-parametersmodel, and it leads to simple and well-understood optimization problems using convex loss functions (e.g.,squared error, hinge, logistic). Neural networks are trained by adjusting the weights in all layers of thenetwork. Hence, neural network models are nonlinear in the parameters. This makes optimization more

1

© 2018 Robert Nowak

Introduction to Neural Networks

1 Neural Networks

A multilayer neural network [1] is a mapping with the following form

y = WL f(WL�1 · · · f(W1x) · · · )) ,

where L is the number of layers, Wj is the weight matrix for layer j = 1, . . . , L, and f is a (sub)differentiable“activation” function (e.g., ReLU or sigmoid) and for any vector v 2 Rk the nonlinearity is applied element-wise f(v) = [f(v1) , f(v2) , . . . , f(vk)]T . In this note we will focus on the case where the output y is scalar(e.g., binary classification), and so WL is a vector. Often a bias or offset is also included. For example, ifxj denotes the input to layer j, then the output can be written as xj+1 = f(Wjxj + bj), where bj is avector of the biases for each node in that layer. Equivalently, we can augment the vector xj with a 1 and adda corresponding additional row equal to bj to the matrix Wj , and thus maintain the notationally compactform above.

2 Kernel Machines are Two-Layer Networks

Recall the idea of a kernel machine, defined by a positive definite kernel k : Rd ⇥Rd ! R. Let {xi, yi}ni=1be a set of training data and consider mappings of the form

y =nX

i=1

↵i k(xi,x) .

Many kernels are expressed in terms of the inner product xTi x or the norm kxi�xk2 = xT

i xi�2xTi x+xTx.

In such cases we can interpret the kernel machine as a two-layer network in which the first layer weightsare W T

1 = [x1 . . . ,xn], the activation function f is defined by the kernel, and the output layer weightvector is W2 = [↵1, . . . ,↵n]. In the case of kernels defined in terms of the norm kxi � xk2 the termsbi := xT

i xi + xTx can be represented by the bias of that node.

3 Linear-in-Parameters Models vs. Nonlinear Parameterizations

Neural networks with nonlinear activation functions and kernel methods with nonlinear kernels both repre-sent nonlinear mappings. However, there is one key distinction. In kernel machines, the input layer weightsare fixed to be the training example feature vectors {xi} and only the output layer weights are adjusted viaoptimization to fit the model to the training labels {yi}. This sort of model is called a linear-in-parametersmodel, and it leads to simple and well-understood optimization problems using convex loss functions (e.g.,squared error, hinge, logistic). Neural networks are trained by adjusting the weights in all layers of thenetwork. Hence, neural network models are nonlinear in the parameters. This makes optimization more

1

multilayer neural net

kernelized classifier

Inception net on CIFAR10

no bias

low varianceerror

training steps

Rethinking Conventional Wisdom

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).

conventional wisdom

deep nets are trained to perfectly fit training data, yet still generalize well

Double-Descent

from Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. "Reconciling modern machine learning and the bias-variance trade-off." arXiv preprint arXiv:1812.11118 (2018).

Risk

Training risk

Test risk

Complexity of H

sweet spot

under-fitting over-fitting

(a) U-shaped “bias-variance” risk curve

Risk

Training risk

Test risk

Complexity of H

under-parameterized

“modern”interpolating regime

interpolation threshold

over-parameterized

“classical”regime

(b) “double descent” risk curve

Figure 1: Curves for training risk (dashed line) and test risk (solid line). (a) The classical U-shaped risk curve

arising from the bias-variance trade-off. (b) The double descent risk curve, which incorporates the U-shapedrisk curve (i.e., the “classical” regime) together with the observed behavior from using high complexityfunction classes (i.e., the “modern” interpolating regime), separated by the interpolation threshold. Thepredictors to the right of the interpolation threshold have zero training risk.

Most classical analyses of ERM are based on controlling the complexity of the function class H (some-times called the model complexity) by managing the bias-variance trade-off (cf., [15]):

1. If H is too small, all predictors in H may under-fit the training data (i.e., have large empirical risk)and hence predict poorly on new data.

2. If H is too large, the empirical risk minimizer may over-fit spurious patterns in the training data thatresult in poor accuracy on new examples (i.e., small empirical risk but large true risk).

The classical analysis is concerned with finding the “sweet spot” for the function class complexity thatbalances these two concerns. The control of the function class complexity may be explicit, via the choiceof H (e.g., picking the neural network architecture), or it may be implicit, using regularization (e.g., earlystopping, complexity penalization). When a suitable balance is achieved, the performance of hn on thetraining data is said to generalize to the population P . This is summarized in the classical U-shaped riskcurve, shown in Figure 1(a).

Conventional wisdom in machine learning expounds the hazards of over-fitting. The textbook corollaryof these hazards is that “a model with zero training error is overfit to the training data and will typicallygeneralize poorly” [15].

However, practitioners routinely use modern machine learning methods to fit the training data perfectlyor near-perfectly. For instance, highly complex neural networks and other non-linear predictors are oftentrained to have very low or even zero training risk. In spite of the high function class complexity andnear-perfect fit to training data, these predictors often have excellent generalization performance—i.e., theygive accurate predictions on new data. Indeed, this behavior has guided a best practice in deep learning forchoosing the neural network architecture, specifically that the network architecture should be large enoughto permit effortless interpolation of the training data [21]. Moreover, recent evidence indicates that neuralnetworks and kernel machines trained to interpolate the training data obtain near-optimal test results evenwhen the training data are corrupted with high levels of noise [28, 3].

In this work, we bridge the gap between classical statistical analyses and the modern practice of machinelearning. We show that the classical U-shaped risk curve of the bias-variance trade-off, as well as the modernbehavior where high function class complexity is compatible with good generalization behavior, can bothbe empirically witnessed with some important function classes including neural networks.

2

error

model complexity (number of parameters)

test error

training error

Maximizing model smoothness subject to the interpolation constraints is a form of the Occam's razor

Theory Meets Practice

Figure 3: Training and test risks of fully connected neural networks with a single layer of H hidden units,learned on a subset of MNIST (n = 4 ⇥ 103, K = 10 classes; left: zero-one loss; right: squared loss). Thenumber of parameters is (d + 1) ⇥ H + (H + 1) ⇥ K. The interpolation threshold is observed at n ⇥ K.

3 Decision trees and ensemble methods

Does the double descent risk curve manifest with other prediction methods besides neural networks? Wegive empirical evidence that the families of functions explored by boosting with decision trees and RandomForests also show similar generalization behavior as neural nets, both before and after the interpolationthreshold.

AdaBoost and Random Forests have recently been investigated in the interpolation regime by Wyneret al. [27]. In particular, they give empirical evidence that, when AdaBoost and Random Forests are usedwith maximally large (interpolating) decision trees, the flexibility of the fitting methods yield interpolatingpredictors that are more robust to noise in the training data than the predictors produced by rigid, non-interpolating methods (e.g., AdaBoost or Random Forests with shallow trees). This in turn is said to yieldbetter generalization. A crucial aspect of the predictors is that they are averages of these (near) interpolatingtrees (a property that is patent with Random Forests, but also observed for AdaBoost, as discussed by Wyneret al.): the averaging ensures that the resulting function is substantially smoother than any individual tree,which aligns with an inductive bias that is compatible with many real world problems.

We can understand these flexible fitting methods in the context of the double descent risk curve. Ob-serve that the size of a decision tree (controlled by the number of leaves) is a natural way to parameterizethe function class complexity: a tree with only two leaves corresponds to two-piecewise constant functionswith axis-aligned boundary, while a tree with n leaves can interpolate n training examples. It is a classicalobservation that the U-shaped bias-variance trade-off curve manifests in many problems when the class com-plexity is considered this way [15]. (The interpolation threshold may be reached with fewer than n leaves inmany cases, but n is clearly an upper bound.) To further enlarge the function class, we consider ensembles(averages) of several interpolating trees.1 So, beyond the interpolation threshold, we use the number of such

1These trees are trained in the way proposed in Random Forest except without bootstrap re-sampling. This is similar to thePERT method of Cutler and Zhao [12].

6

from Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. "Reconciling modern machine learning and the bias-variance trade-off." arXiv preprint arXiv:1812.11118 (2018).

MNIST experiments with kernels, random features, and neural nets

Rethinking Active Learning

In the interpolating regime, the training error is identically zero and yields no information about the expected loss

Standard active learning theory and methods are based on bounding test error in terms of training error (e.g., VC theory)

Kernel Machines and Neural Networks

y<latexit sha1_base64="1T5uYEBBxjYA6j3HJ/HRdJfWWs4=">AAAEO3icbZNNb9MwGMezlJdRKHRw5GLRIiEOVTMOICSkTd2kHqqpLevWqSmV4zhrVMeObAcWWfleXPgS3Lhw4QBCXLnjtE5Ytllynr9+z5vjFy8moZDd7rctu3br9p272/fq9x80Hj5q7jw+ESzhCE8QI4xPPSgwCSmeyFASPI05hpFH8Km36uX+04+Yi5DRY5nGeB7BcxoGIYJSo8WOPWo0XIo/ySVmHEcq5izO1FB/mQjzkKziJjiKYKYGa1PxGJupYyMqXsQ4IwTyNFO9UlYifByEdNNRHfzXlRg9IV9laryxjXrd1WnAdZQbQbn0AuDohDU6NMgDhwUalmiYmcRRiUZF1LhE4wL1lXIRJKCfFWTfkP2S9AzplWRqyLQkZ4acZTdUNiuaGDIpycCQQUm8RLkeI75II21UUjrQUqEPL4viUaR6hT5SR3q78r1ETB8d9XXN99nlMu4BJhKWLS4qLS7yFss0XmIaYwqJTN85XT30AVAWUh9TCVwRgHY7bbcXzVa3010PcF04RrQsM4aL5lfXZyiJdBVEoBAzpxvLuYJchohg3TkROIZoBc/xTEsKIyzman33M/BcEx8EjOupV7GmlzMUjET+FzoyP1dx1ZfDm3yzRAZv5iqkcSIxRZtGQUKAZCB/SMAPOUaSpFpAxPVtRQAtIYdI6udW15vgXP3l6+Jkt+O86jij3dbeW7Md29ZT65n1wnKs19ae1beG1sRC9mf7u/3T/lX7UvtR+137swm1t0zOE6syan//AR8Sa48=</latexit>

x1<latexit sha1_base64="TAaxbSZTwesI+QdIK0Z6h4BQ9uw=">AAAEPXicbZNNb9MwGMezlJdRKHRw5GLRIiEOVbMdQEhIm7pJPVRT261bp6ZUjuMsUR07sh1oZeWLceE7cOPGhQMIceWK07ph2WbJef76PW+OX7yEREK229+27Mqdu/fubz+oPnxUe/ykvvP0TLCUIzxCjDA+9qDAJKJ4JCNJ8DjhGMYewefevJP7zz9iLiJGT+UywdMYXtIoiBCUGs127JNazaX4kwwx4zhWCWdJpvr6y0SUh2QlN8FxDDPVW5mSx9hMnRpR8iLGGSGQLzPVKWQpwsdBRNcd1eF/XYrRE/J5poZrW6tWXZ0GXEe5MZShFwBHJ6zQkUEeONqgfoH6mUkcFGiwiRoWaLhBXaVcBAnoZhtyYMhBQTqGdAoyNmRckAtDLrJbKpsVjQwZFaRnSK8gXqpcjxFfLGNtVFo4UKjQh9eb4nGsOht9rI71duV7iZg+OurrmifZ1TLuISYSFi0WpRaLvEW4TEJME0whkcv3TlsPfQCURdTHVAJXBKDZXMycZnNWb7Rb7dUAN4VjRMMyoz+rf3V9htJY10EECjFx2omcKshlhAjWvVOBE4jm8BJPtKQwxmKqVrc/Ay818UHAuJ56HSt6NUPBWOT/oSPzkxXXfTm8zTdJZfB2qiKapBJTtG4UpARIBvKnBPyIYyTJUguIuL6vCKAQcoikfnBVvQnO9V++Kc52W85eyxnsNvbfme3Ytp5bL6xXlmO9sfatrtW3RhayP9vf7Z/2r8qXyo/K78qfdai9ZXKeWaVR+fsP/01sMg==</latexit>

• kernel machine is a single hidden-layer neural network

• interpolation possible with infinite dimensional Reproducing Kernel Hilbert Space (RKHS) or overparameterized neural net

xd<latexit sha1_base64="r2010inAlqabgqxOUvjlkoynevQ=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFo86JHoxSMmFkygIdvtFjbsbpvdrZE0/AYvHjTGqz/Im//GBXpQ8CWTvLw3k5l5YcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyeZItQnCU/UQ4g15UxS3zDD6UOqKBYhp91wfDPzu49UaZbIezNJaSDwULKYEWys5NefBlF9UK25DXcOtEq8gtSgQHtQ/epHCckElYZwrHXPc1MT5FgZRjidVvqZpikmYzykPUslFlQH+fzYKTqzSoTiRNmSBs3V3xM5FlpPRGg7BTYjvezNxP+8XmbiqyBnMs0MlWSxKM44MgmafY4ipigxfGIJJorZWxEZYYWJsflUbAje8surpNNseBcN765Za10XcZThBE7hHDy4hBbcQht8IMDgGV7hzZHOi/PufCxaS04xcwx/4Hz+ABIkjjA=</latexit>

…

Active Learning in Overparameterized Setting

f = model interpolating 6 initial labeled examples<latexit sha1_base64="XjnbQxa4l5mbuYRgVqHdue7p5nk=">AAAEb3icbZPfb9MwEMeztsAoFLbxwMMQsliRgIepGRIgpEmbukl9mKZurFunpZsc57JG84/IdmCVlUf+Qd74H3jhP8Bp3bBss5TcV5+789lnO0xponSn83uhVm88ePho8XHzydPWs+dLyyvHSmSSwIAIKuQwxApowmGgE01hmErALKRwEl51C//Jd5AqEfxIT1IYMXzJkzghWFt0sVz72WoFHH7oMQgJzKRSpLnp279QSRGSV9wUGMO52ZuaisfZ3Bw5UfESIQWlWE5y0y1lJSKCOOGzimbnv67E2A/Lq9wczmyr2QxsGgp8EzCsx2GMfJswRbsOhWh3jvol6ucu8aBEB/OowxIdzlHPmIBginr5nGw7sl2SriPdkgwdGZbk1JHT/J6Z3YoGjgxKsufIXknCzAShoJGaMGtMVjrI2JDzD/PJGTPdud43+7ZdRS+JsEfHIzvnt/zmNMEOUI3LEteVEtdFifEkHQNPgWOqJ5t+xw57AFwkPAKuUaBi1LapaBO1ERMRUJRwDTIV1N41fonan9poeqx2MxSHQCFCcI1ZSkFdLK111jvTge4K34k1z43+xdKvIBIkY7YyoVipM7+T6pHBUieEgl1tpiDF5ApfwpmVHDNQIzN9Lzl6a0mEYiHtZ1c+pTczDGaq2LmNLO6Cuu0r4H2+s0zHX0Ym4WmmgZNZoTijSAtUPD4UJRKIphMrMJG2FQSRMZaY2D6ppm2Cf3vLd8Xxxrr/cd0/2Fjb+uraseitem+8d57vffa2vJ7X9wYeqf2pr9RX66/qfxsvG68baBZaW3A5L7zKaLz/B7qoeR0=</latexit>

+1

-1

unlabeled examples<latexit sha1_base64="ySQQ1zBc5zv+xVl7Fn//JoF1RNw=">AAAEXXicbZNNT9swGMdDWzboKIPtsMMu1tCkaQfUsMOmSZNABYkDQm1HoYh0yHGerlEdO7KdrZWVL7nbdtlX2ZPWzQhgyXn++j1vjl/ClMfatNu/12r1xvqTpxubzWdbre3nO7svLrXMFIMBk1yqYUg18FjAwMSGwzBVQJOQw1U47RT+qx+gdCzFhZmnMErodxGPY0YNotvdmm61AgE/zQSkgsSmSqa57eJX6rgIyStuDklCc3u2MBWPs7m9cKLiZVJJzqma57ZTykpEBONYLDva4/+6EoOTqmlu+0vbajYDTCOBb4OEmkk4Jj4mLNCJQyE5WaFuibq5S+yVqLeK6peov0Kn1gaMcnKar8iRI0cl6TjSKcnQkWFJrh25zh+p7FY0cGRQkjNHzkoSZjYIJY/0PEFjs9LBJpZ9e78qniS2s9Ln9hy3q9hLJvHoRIQ1v+Z3ywTHwA0tW8wqLWZFi8k8nYBIQVBu5l/8Ng48ACFjEYEwJNCYZ2BmFpfS4j1DinclE5yGwCEiMKNJykHntzt77f32YpCHwndiz3Oje7vzK4gkyxKsyDjV+sZvp2ZkqTIx44BryzSklE2x5w1KQRPQI7tYSE7eIonIWCqcuM4FvZthaaKL/8TI4uT1fV8BH/PdZGb8aWRjkWYGBFs2GmecGEmKp0aiWAEzfI6CMoX3mRE2oYoygw+yiZvg3//lh+LyYN//sO/3DvYOP7vt2PBee2+8d57vffQOvVOv6w08VvtT9+qb9Wb9b2O9sdXYXobW1lzOS68yGq/+Ab9/dRk=</latexit>

u<latexit sha1_base64="LV/Vb9CEtiEC5W+OvZ9CaU6N+14=">AAAEQXicbZNNb9MwGMezhJcRKHRw5GLRIiEOVTMOIE6bukk9VFNb1q1TUyrHcWhUJ45sB1pZ+Wpc+AbcuHPhAEJcueCkTli2WXKev37Pm+3YXkJCLrrdbzumdev2nbu79+z7DxoPHzX3Hp9xmjKEJ4gSyqYe5JiEMZ6IUBA8TRiGkUfwubfq5f7zj5jxkManYpPgeQQ/xGEQIigUWuyZ542GG+NPYokpw5FMGE0yOVRfysM8JKu5CY4imMlBYWoebTN5qkXNiyijhEC2yWSvkrUIHwdhvO0oj/7rWoyakK0yOd7ahm27Kg24jnQjKJZeAByVUKBjjTxwXKJhhYaZThxVaFRGjSs0LlFfShdBAvpZSQ41OaxIT5NeRaaaTCtyoclFdkNlvaKJJpOKDDQZVMRLpetR4vNNpIxMKwdaSvT+ZVk8imSv1CfyRB1XfpaIql8X+6rmu+xyGfcIEwGrFutai7VqobJpGPs4FsDlAbDtdtsVeC2KOyjVtVIemJdMs3bbthfNVrfTLQa4LhwtWoYew0Xzq+tTlEaqDCKQ85nTTcRcQiZCRLBaWMpxAtFKNZopGcMI87ksumfguSI+CChTUy2woJczJIx4vhcVmf9dftWXw5t8s1QEb+YyjJNU4BhtGwUpAYKC/DkBP2QYCbJRAiKm7iwCaAkZREI9uvwQnKtbvi7O9jvOq44z2m8dvNXHsWs8NZ4ZLwzHeG0cGH1jaEwMZH42v5s/zV/WF+uH9dv6sw01d3TOE6M2rL//AHvXbp0=</latexit>

which example should be labeled next?


fu�<latexit sha1_base64="d23TXiyvau6xx6zUPj245Pftau8=">AAAEUnicbZNNb9MwGMeztMAoK9vgyCViQ0JITM04gJCQNnWTdqimbqxbp6WrHMdZojl2ZDvQyvJnREJc+CBcOABPWjcs2yw5z1+/583xS5jTVKpO5+eS22g+ePho+XHryUr76era+rNTyQuByQBzysUwRJLQlJGBShUlw1wQlIWUnIXX3dJ/9oUImXJ2oqY5GWXoiqVxipECNF53k3Y7YOSrSggXJNO54LnRffhymZYhpuamJMuQ0b2ZqXmsNfrEipoXc8EpRWJqdLeStYiIxCmbd9R7/3UtBiYS10Yfz2271QogzQt8HWRIJWHs+ZAwQ/sWhd7+AvUr1Dc28ahCR4uo4wodL9CB1gFG1DswC7JryW5FupZ0KzK0ZFiRc0vOzT2V7YoGlgwq0rOkV5Gw0EHIaSSnGRhdVA6caHz5ZlE8y3R3oQ/1IWxXuZeYw9GxCGp+NjfLBHuEKlS1mNRaTMoWyTRPCMsJQ1RNP/kdGHAAjKcsIkx5gYQ8RSZqdinhrCKjN+NLqFOY8dtNM17b6Gx1ZsO7K3wrNhw7+uO170HEcZFBcUyRlBd+J1cjjYRKMSWwoEKSHOFrdEUuQDKUETnSs+7GewUk8mIuYMLiZvRmhkaZLH8OIsvjlrd9JbzPd1Go+MNIpywvFGF43iguqKe4V74vL0oFwYpOQSAs4BJjDydIIKzgFbZgE/zbv3xXnG5v+e+2/KPtjZ2PdjuWnRfOS+e14zvvnR3nwOk7Awe739xf7h/3b+NH43dzqdmYh7pLNue5UxvNlX8MenOd</latexit>



fu+

<latexit sha1_base64="Pva+uuLV8K9YEouk9WrcE2ISdGE=">AAAEU3icbZNNb9MwGMe9tMAolG1w5BLRISGQpmYcQEhIm7pJPUxTV9atU9NVjuOsVh07sh1YZfk7IiQOfBEuHMBp3bBss+Q8f/2eN8cvUUaJVO32rzWvVn/w8NH648aTp81nG5tbz88kzwXCA8QpF8MISkwJwwNFFMXDTGCYRhSfR7NO4T//ioUknJ2qeYbHKbxiJCEIKosmWx5pNkOGv6kp5gKnOhM8M7pnv1ySIsRU3BSnKTT6aGEqHmeNPnWi4kVccEqhmBvdKWUlIsYJYcuO+uC/rsTYCcXM6P7SNhuN0Kb5YaDDFKpplPiBTVigQ4ci/3CFeiXqGZd4UqKTVVS/RP0V6modIkj9rlmRfUf2S9JxpFOSoSPDklw4cmHuqexWNHBkUJIjR45KEuU6jDiN5Ty1RuelA001uny7Kp6murPSx/rYblexl4jbo2OxrfnF3CwTHmCqYNniutLiumgxnWdTzDLMIFXzz0HbDnsAjBMWY6b8UNo8ha/V4lLqiObY6O3k0hbKzeTdtplstto77cXw74rAiRZwozfZ/BHGHOWprY4olHIUtDM11lAogii2K8olziCawSs8spLBFMuxXrQ3/mtLYj/hwk67ugW9maFhKou/s5HFecvbvgLe5xvlKvk41oRlucIMLRslOfUV94sH5sdEYKTo3AqIhL3FyEdTKCBS9hk27CYEt3/5rjjb3Qne7wQnu629T2471sFL8Aq8AQH4APZAF/TAACDvu/fb+1sDtZ+1P3WvXl+Gemsu5wWojHrzH5QVcxM=</latexit>

easy to interpolate, small kfu�k

<latexit sha1_base64="76+2Pe9Fr0Q9SqaEqNlG2MFwi3k=">AAAEcXicbZPfb9MwEMeztsAoFDbgBSGQtQ0J8WNqxgMICWlTN2kP09SNdeu0dJXjXNZojh3ZDqzy/Mz/xxv/BC/8AzitG5Ztlpz76nN3PvvshBlNpGq3f8/V6o07d+/N328+eNh69Hhh8cmh5Lkg0COcctEPsQSaMOipRFHoZwJwGlI4Cs87hf/oOwiZcHagxhkMUnzGkjghWFk0XKz9bLUCBj/UCLiAVGeCZ0Z37ZfLpAgxFTeFNMVG70xMxeOs0QdOVLyEC04pFmOjO6WsREQQJ2xaUW/+15UYO7E4N3p/alvNZmDTUODrIMVqFMbItwkTtOVQiLZmqFuirnGJeyXam0Xtl2h/hra1DgimaNvMyIYjGyXpONIpSd+RfkmOHTk2t6zsdtRzpFeSHUd2ShLmOgg5jeQ4tUbnpYOMNDl9O1s8TXVnpnf1rm1X0UvC7dWxyK75zVxdJtgEqnBZ4qJS4qIoMRpnI2AZMEzV+KvftsNeAOMJi4ApFMgYAZZjpDhKmAKRcYoVvEcyxZSiQMGFmrxXe42R0SvBZXxqi+Rm+CG4XDHDheX2ansy0E3hO7HsudEdLvwKIk7y1NYmFEt54rczNdBYqIRQsPvNJWSYnOMzOLGS4RTkQE92YNBrSyIUc2Gn3fuEXs3QOJXF2W1k8RrkdV8Bb/Od5Cr+PNAJy3IFjEwLxTktmlL8fihKBBBFx1ZgIuwbJ4iMsMDENkw2bRP860e+KQ7XVv2Pq/7e2vL6F9eOee+Ft+S98Xzvk7fubXtdr+eR2p/6s/rL+qv638bzBmosTUNrcy7nqVcZjXf/AGXcfIg=</latexit>

di�cult to interpolate, large kfu+k

<latexit sha1_base64="/4aJq84Ap7nOnYwlDzRQoWRI0c4=">AAAEd3icbZNRb9MwEMeztsAoFDZ4gwcsOtAEaGrGAwgJaVM3aQ/T1I1167R0leNc1miOHdkOrPL8EfhyvPE9eOENp3XDss2S479+d+ezz5cwo4lUnc7vhVq9ce/+g8WHzUePW0+eLi0/O5I8FwT6hFMuBiGWQBMGfZUoCoNMAE5DCsfhRbewH38HIRPODtUkg2GKz1kSJwQri0bLtZ+tVsDghxoDF5DqTPDM6J79cpkULqZippCm2Ojd6VKxuNXoQycqVsIFpxSLidHdUlY8IogTNsuot/7rio+dWFwYfTBbW81mYMNQ4OsgxWocxsi3AVO07VCItueoV6KecYH7Jdqfex2U6GCOdrQOCKZox8zJpiObJek60i3JwJFBSU4cOTF37OxO1HekX5JdR3ZLEuY6CDmN5CS1i85LAxlrcvZuvnma6u5c7+k9W66iloTbp2OR3fObub5NsAVU4TLFZSXFZZFiPMnGwDJgmKrJV79jh30AxhMWAVMokDGKkti2Vk4VUhwlTIHIOMUKPiD74OeAAgWXatq0OqQ5GL0SXMVnNlVuRu+DqxUzWmp31jrTgW4L34m250ZvtPQriDjJU3sCQrGUp34nU0ONhUoIBXvqXEKGyQU+h1MrGU5BDvX0CAa9sSRCMRd22htM6fUIjVNZVMB6Fj0hb9oKeJftNFfx56FOWJYrYGSWKM5pUZXiJ7R1EkAUnViBibCdThAZY4GJrZhs2iL4N698Wxytr/kf1/z99fbGF1eORe+l99pb9Xzvk7fh7Xg9r++R2p/6i3q7vlL/23jVeNtYnbnWFlzMc68yGv4/qrt/KQ==</latexit>

New example in between identically labeled examples

less smooth<latexit sha1_base64="fhsjRTE3yU4VRIVtwtCt7Gx7XwI=">AAAEQXicbZNNb9MwGMezlJdRKHRw5GJRISEOVTMOICSkTd2kHqqpLevWqSmV4zhLVL9EtgOrrHw1LnwDbty5cAAhrlxwWjcs2yw5z1+/580vcZCSRKpO59uWW7t1+87d7Xv1+w8aDx81dx6fSJ4JhMeIEy4mAZSYJAyPVaIInqQCQxoQfBosuoX/9CMWMuHsWC1TPKPwnCVRgqAyaL7jnjYaPsOfVIy5wFSngqe5Hpgvl0kRklfcBFMKc91fmYrH2lwfW1HxIi44IVAsc90tZSUixFHC1h31wX9diTETikWuR2vbqNd9kwZ8T/sUqjiIgGcSVujQogAcbtCgRIPcJg5LNNxEjUo02qCe1j6CBPTyDdm3ZL8kXUu6JZlYMinJmSVn+Q2V7YrGloxL0rekX5Ig037ASSiX1BidlQ4Ua/Th5aY4pbq70Uf6yBxXcZaIm6tjoan5Pr9cxj/ARMGyxUWlxUXRIl6mMWYpZpCo5TuvY4a5AMYTFmKmgC8jQLCUQFLOVTxvtjrtzmqA68KzouXYMZg3v/ohRxk1tRCBUk69TqpmGgqVIIJN/0ziFKIFPMdTIxmkWM706gXk4LkhIYi4MNOsZUUvZ2hIZbEXE1ncrrzqK+BNvmmmojcznbA0U5ihdaMoI0BxUDwnECYCI0WWRkAkzD+LAIqhgEiZR1c3h+Bd3fJ1cbLb9l61veFua++tPY5t56nzzHnheM5rZ8/pOQNn7CD3s/vd/en+qn2p/aj9rv1Zh7pbNueJUxm1v/8Aal1vMw==</latexit>

smoother<latexit sha1_base64="0Q+RaU/t40TSPQqwdGrOC7aZFgM=">AAAEPnicbZNNb9MwGMezlJdRKHRw5GJRISEOVTMOICSkTd2kHqqpLe3WqSmT4zhLVL9EtgOrrHwyLnwGbhy5cAAhrhxxWjcs2yw5z1+/583xS5CSRKpO59uWW7t1+87d7Xv1+w8aDx81dx4fS54JhCeIEy6mAZSYJAxPVKIInqYCQxoQfBIsuoX/5CMWMuFsrJYpnlN4zpIoQVAZdLbjjhsNn+FPKsZcYKpTwdNcD8yXy6QIyStugimFue6vTMVjba7HVlS8iAtOCBTLXHdLWYkIcZSwdUd98F9XYsyEYpHr0do26nXfpAHf0z6FKg4i4JmEFTq0KACHGzQo0SC3icMSDTdRoxKNNqintY8gAb18Q/Yt2S9J15JuSaaWTEtyaslpfkNlu6KJJZOS9C3plyTItB9wEsolNUZnpQPFGn14uSlOqe5u9JE+MttV7CXi5uhYaGq+zy+X8Q8wUbBscVFpcVG0iJdpjFmKGSRq+c7rmGEOgPGEhZgp4MsISMq5OSxx1mx12p3VANeFZ0XLsWNw1vzqhxxl1BRCBEo58zqpmmsoVIIINs0ziVOIFvAcz4xkkGI516vrn4PnhoQg4sJMs5AVvZyhIZXFj5jI4mjlVV8Bb/LNMhW9meuEpZnCDK0bRRkBioPiLYEwERgpsjQCImEuLAIohgIiZV5c3WyCd/WXr4vj3bb3qu0Nd1t7b+12bDtPnWfOC8dzXjt7Ts8ZOBMHuZ/d7+5P91ftS+1H7XftzzrU3bI5T5zKqP39B9W+bhU=</latexit>

k · k = RKHS norm or norm of neural network weights<latexit sha1_base64="1irDbV0SCML4wTpKsgGn+MgxxYA=">AAACK3icbVC5TgMxEPVyhnAFKGksAhJVtAsFNEgRNJFouAJI2SjyOrOJFa+9smdBUeB/aPgVCig4RMt/4IQtuJ5kzdN7M/LMi1IpLPr+qzc2PjE5NV2YKc7OzS8slpaWz63ODIc611Kby4hZkEJBHQVKuEwNsCSScBH1Dob+xRUYK7Q6w34KzYR1lIgFZ+ikVmk/tDFdD29C3tYY3tCQ7rm3Tk8Oa6dUaZNQbfIaUwWZYdIVvNamR69BdLpoW6WyX/FHoH9JkJMyyXHUKj2Gbc2zBBRyyaxtBH6KzQEzKLiE22KYWUgZ77EONBxVLAHbHIxuvaUbTmnT2G0Va4V0pH6fGLDE2n4Suc6EYdf+9obif14jw3i3ORAqzRAU//ooziRFTYfB0bYwwFH2HWHcCLcr5V1mGEcXb9GFEPw++S8536oE25XgeKtc3c/jKJBVskY2SUB2SJXUyBGpE07uyAN5Ji/evffkvXnvX61jXj6zQn7A+/gETBKl+g==</latexit>




fu+

<latexit sha1_base64="Pva+uuLV8K9YEouk9WrcE2ISdGE=">AAAEU3icbZNNb9MwGMe9tMAolG1w5BLRISGQpmYcQEhIm7pJPUxTV9atU9NVjuOsVh07sh1YZfk7IiQOfBEuHMBp3bBss+Q8f/2eN8cvUUaJVO32rzWvVn/w8NH648aTp81nG5tbz88kzwXCA8QpF8MISkwJwwNFFMXDTGCYRhSfR7NO4T//ioUknJ2qeYbHKbxiJCEIKosmWx5pNkOGv6kp5gKnOhM8M7pnv1ySIsRU3BSnKTT6aGEqHmeNPnWi4kVccEqhmBvdKWUlIsYJYcuO+uC/rsTYCcXM6P7SNhuN0Kb5YaDDFKpplPiBTVigQ4ci/3CFeiXqGZd4UqKTVVS/RP0V6modIkj9rlmRfUf2S9JxpFOSoSPDklw4cmHuqexWNHBkUJIjR45KEuU6jDiN5Ty1RuelA001uny7Kp6murPSx/rYblexl4jbo2OxrfnF3CwTHmCqYNniutLiumgxnWdTzDLMIFXzz0HbDnsAjBMWY6b8UNo8ha/V4lLqiObY6O3k0hbKzeTdtplstto77cXw74rAiRZwozfZ/BHGHOWprY4olHIUtDM11lAogii2K8olziCawSs8spLBFMuxXrQ3/mtLYj/hwk67ugW9maFhKou/s5HFecvbvgLe5xvlKvk41oRlucIMLRslOfUV94sH5sdEYKTo3AqIhL3FyEdTKCBS9hk27CYEt3/5rjjb3Qne7wQnu629T2471sFL8Aq8AQH4APZAF/TAACDvu/fb+1sDtZ+1P3WvXl+Gemsu5wWojHrzH5QVcxM=</latexit>

fu�<latexit sha1_base64="d23TXiyvau6xx6zUPj245Pftau8=">AAAEUnicbZNNb9MwGMeztMAoK9vgyCViQ0JITM04gJCQNnWTdqimbqxbp6WrHMdZojl2ZDvQyvJnREJc+CBcOABPWjcs2yw5z1+/583xS5jTVKpO5+eS22g+ePho+XHryUr76era+rNTyQuByQBzysUwRJLQlJGBShUlw1wQlIWUnIXX3dJ/9oUImXJ2oqY5GWXoiqVxipECNF53k3Y7YOSrSggXJNO54LnRffhymZYhpuamJMuQ0b2ZqXmsNfrEipoXc8EpRWJqdLeStYiIxCmbd9R7/3UtBiYS10Yfz2271QogzQt8HWRIJWHs+ZAwQ/sWhd7+AvUr1Dc28ahCR4uo4wodL9CB1gFG1DswC7JryW5FupZ0KzK0ZFiRc0vOzT2V7YoGlgwq0rOkV5Gw0EHIaSSnGRhdVA6caHz5ZlE8y3R3oQ/1IWxXuZeYw9GxCGp+NjfLBHuEKlS1mNRaTMoWyTRPCMsJQ1RNP/kdGHAAjKcsIkx5gYQ8RSZqdinhrCKjN+NLqFOY8dtNM17b6Gx1ZsO7K3wrNhw7+uO170HEcZFBcUyRlBd+J1cjjYRKMSWwoEKSHOFrdEUuQDKUETnSs+7GewUk8mIuYMLiZvRmhkaZLH8OIsvjlrd9JbzPd1Go+MNIpywvFGF43iguqKe4V74vL0oFwYpOQSAs4BJjDydIIKzgFbZgE/zbv3xXnG5v+e+2/KPtjZ2PdjuWnRfOS+e14zvvnR3nwOk7Awe739xf7h/3b+NH43dzqdmYh7pLNue5UxvNlX8MenOd</latexit>

di�cult to interpolate, large kfu+k

<latexit sha1_base64="/4aJq84Ap7nOnYwlDzRQoWRI0c4=">AAAEd3icbZNRb9MwEMeztsAoFDZ4gwcsOtAEaGrGAwgJaVM3aQ/T1I1167R0leNc1miOHdkOrPL8EfhyvPE9eOENp3XDss2S479+d+ezz5cwo4lUnc7vhVq9ce/+g8WHzUePW0+eLi0/O5I8FwT6hFMuBiGWQBMGfZUoCoNMAE5DCsfhRbewH38HIRPODtUkg2GKz1kSJwQri0bLtZ+tVsDghxoDF5DqTPDM6J79cpkULqZippCm2Ojd6VKxuNXoQycqVsIFpxSLidHdUlY8IogTNsuot/7rio+dWFwYfTBbW81mYMNQ4OsgxWocxsi3AVO07VCItueoV6KecYH7Jdqfex2U6GCOdrQOCKZox8zJpiObJek60i3JwJFBSU4cOTF37OxO1HekX5JdR3ZLEuY6CDmN5CS1i85LAxlrcvZuvnma6u5c7+k9W66iloTbp2OR3fObub5NsAVU4TLFZSXFZZFiPMnGwDJgmKrJV79jh30AxhMWAVMokDGKkti2Vk4VUhwlTIHIOMUKPiD74OeAAgWXatq0OqQ5GL0SXMVnNlVuRu+DqxUzWmp31jrTgW4L34m250ZvtPQriDjJU3sCQrGUp34nU0ONhUoIBXvqXEKGyQU+h1MrGU5BDvX0CAa9sSRCMRd22htM6fUIjVNZVMB6Fj0hb9oKeJftNFfx56FOWJYrYGSWKM5pUZXiJ7R1EkAUnViBibCdThAZY4GJrZhs2iL4N698Wxytr/kf1/z99fbGF1eORe+l99pb9Xzvk7fh7Xg9r++R2p/6i3q7vlL/23jVeNtYnbnWFlzMc68yGv4/qrt/KQ==</latexit>

di�cult to interpolate, large kfu�k

<latexit sha1_base64="ecUWcXSWIlgfwC9f/2MDUHqOCno=">AAAEdnicbZNRb9MwEMeztsAoFDZ4QkjIWjeBEEzNeAAhIW3qJu1hmrqxbp2WrnKcyxrNsSPbgVWevwGfjjc+By884rRuWMYsOf7rd3c++3wJM5pI1en8WqjVG/fuP1h82Hz0uPXk6dLys2PJc0GgTzjlYhBiCTRh0FeJojDIBOA0pHASXnYL+8k3EDLh7EhNMhim+IIlcUKwsmi0XPvRagUMvqsxcAGpzgTPjO7ZL5dJ4WIqZgppio3emy4Vi1uNPnKiYiVccEqxmBjdLWXFI4I4YbOMevufrvjYicWl0YeztdVsBjYMBb4OUqzGYYx8GzBFOw6FaGeOeiXqGRd4UKKDuddhiQ7naFfrgGCKds2cbDmyVZKuI92SDBwZlOTUkVNzx87uRH1H+iXZc2SvJGGug5DTSE5Su+i8NJCxJudv55unqe7O9b7et+Uqakm4fToW2T2/mpvbBNtAFS5TXFVSXBUpxpNsDCwDhqmafPE7dtgHYDxhETCFAhmjKIlta+VUIcVRwhSIjFOs4B2yD34BKFBwpaZNa98yMno1uI7PbabcjN4H16tmtNTurHemA/0vfCfanhu90dLPIOIkT+0BCMVSnvmdTA01FiohFOyhcwkZJpf4As6sZDgFOdTTExi0ZkmEYi7stBeY0psRGqeyKID1LFpC3rYV8C7bWa7iT0OdsCxXwMgsUZzToijFP2jLJIAoOrECE2EbnSAyxgITWzDZtEXwb1/5f3G8se5/WPcPNtqbn105Fr2X3or3xvO9j96mt+v1vL5Har/rL+or9Xb9T+NVY63xeuZaW3Axz73KaHT+AuxIfrQ=</latexit>



k · k = RKHS norm or norm of neural network weights<latexit sha1_base64="1irDbV0SCML4wTpKsgGn+MgxxYA=">AAACK3icbVC5TgMxEPVyhnAFKGksAhJVtAsFNEgRNJFouAJI2SjyOrOJFa+9smdBUeB/aPgVCig4RMt/4IQtuJ5kzdN7M/LMi1IpLPr+qzc2PjE5NV2YKc7OzS8slpaWz63ODIc611Kby4hZkEJBHQVKuEwNsCSScBH1Dob+xRUYK7Q6w34KzYR1lIgFZ+ikVmk/tDFdD29C3tYY3tCQ7rm3Tk8Oa6dUaZNQbfIaUwWZYdIVvNamR69BdLpoW6WyX/FHoH9JkJMyyXHUKj2Gbc2zBBRyyaxtBH6KzQEzKLiE22KYWUgZ77EONBxVLAHbHIxuvaUbTmnT2G0Va4V0pH6fGLDE2n4Suc6EYdf+9obif14jw3i3ORAqzRAU//ooziRFTYfB0bYwwFH2HWHcCLcr5V1mGEcXb9GFEPw++S8536oE25XgeKtc3c/jKJBVskY2SUB2SJXUyBGpE07uyAN5Ji/evffkvXnvX61jXj6zQn7A+/gETBKl+g==</latexit>

New example in between oppositely labeled examples

Max-Min Sampling Criterion

Selection of next example to label

u? = argmaxu2U

minnkfu

�k , kfu+k

o

<latexit sha1_base64="dBWD52A9Mf6e3AHpEOVUr8B6AZ0=">AAAE7XicbZTPb9MwFMeztcAIFDY4crFYkRCMqRkHuIA2ukk7TFP3o1unuascx1nDnDhyHGjl+X/gwgGEuPL/cOO/4aV1wzJmyfHT533fe/GzEz/lUaZarT9z87X6rdt3Fu669+43HjxcXHp0lIlcUtalggvZ80nGeJSwrooUZ71UMhL7nB37F+3Cf/yJySwSyaEap6wfk/MkCiNKFKDBUq3WaOCEfVZDJiSLdSpFanQHniKLCompuDmLY2L0zmSpeOxq9KE1Kl4qpOCcyLHR7dKsKAIWRsm0ot78Z1c0MIm8MHp/ujZcF0MYwp7GMVFDP0QeBEzQlkU+2pqhTok6xgbulWhvptov0f4MbWuNKeFo28zIhiUbJWlb0i5Jz5JeSU4sOTE3ZLZv1LWkW5IdS3ZK4uca+4IH2TiGReelgw41PXsxSx7Huj2zd/UutKvoJRVwdEkAOQ/M1TR4k3FFyhKjSokRlHBxlAQsUQhnIXIPGGe0OCEkQpSwkUJsROKUM6QE4sRn3G02sQLH5I5quHYQS4qS+RnOFJEGYfQOJibyHFo+GhQuBEVg7/DysOIP0TnW+PJKGskCo8OBfmXOCrkB5SVeQSsIHhWhz3M2Ub6cKfHlNCGEDLOUUKbXosSgZtN1B4vLrdXWZKD/Dc8ay44dncHibxwImsewJcpJlp16rVT1NZEqopxBE/OMQYkL2PQpmAmJWdbXkzcz6BmQAIVCwoR2TujVCE3irOg7KIubmF33FfAm32muwrd9HSVprlhCp4XCnBcnUnz6KIgknBkfg0GohO+LIjokklAFP4iiCd71Lf9vHK2teq9Xvb215fX3th0LzhPnqfPc8Zw3zrqz7XScrkNrH2tfat9q3+ui/rX+o/5zKp2fszGPncqo//oLUBGrgg==</latexit>

Intuition: attacking the most challenging points in the input space first may eliminate the need to label other “easier” examples later

Mina Karzand and RN. “Active Learning in the Overparameterized and Interpolating Regime.” arXiv preprint arXiv:1905.12782 (2019).

Properties of Max-Min Sampling in RKHS

Assume that f 2 H, H is a Reproducing Kernel Hilbert Space (RKHS),

and f is the minimum RKHS-norm interpolator of the labeled examples<latexit sha1_base64="KKaEGHs1c293JxRO0CQcQktZbFk=">AAAEtXicbZNNT9swGMcD7TbWrRtsx12swSQ2bahhB3aaQAUJCYRKoVBEOuQ4T6iFXyLb2aisfMLddtu3mZO6GQUsJf7r97zZj+04Y1SbTufvwmKj+eTps6XnrRcv269eL6+8OdMyVwQGRDKphjHWwKiAgaGGwTBTgHnM4Dy+6Zb285+gNJXi1EwyGHF8LWhKCTYOXa0s/m63IwG/zBikAm4zJbPC9txfalq6FHNmBpzjwh5W05zFz4U99WLOSqSSjGE1KWy3lnMeCaRUTCva3f96zsd9WN0Utj+d261W5MJQFNqIYzOOUxS6gArteRSjvRnq1ahX+MDjGh3PvPo16s/QvrURwQztFzOy48lOTbqedGsy9GRYkwtPLopHMvsVDTwZ1OTQk8OaxLmNYskSPeFusnltIGNLfnyaJefcdmf6yB65dpW9JNIdnUhczpPibppoF5jBdYnbuRK3roSLllQkIAyKdIp2tM45IDPGBq25CCrcdtY+ozX3R1QjjPrgLlOSEyqu0QEoAW6flMWgDDrJMAG03j/YP/n4GUURcityWapAd9iIu+PnOUelwxchFUdUGFCZZNhIhWRaeTEcA4MEwS3mGQPdulpe7Wx0qoEeitCL1cCP3tXynyiRxO1DGMKw1pdhJzMji5WhhIFrRq7BLfUGX8OlkwJz0CNbvboCfXAkQalbTypdUyp6N8Jirsv+Oc/yRun7thI+ZrvMTfptZKnIcgOCTAulOUNGovIJo4QqIIZNnMBEuXdCEBljhYlrUdWE8P6WH4qzzY3w60Z4vLm6/d23Yyl4F7wP1oMw2Aq2g/2gFwwC0thsDBu4ETe3mqNm0kynrosLPuZtMDea8h/WSpC/</latexit>


u? = argmaxu2U

minnkfu

�k , kfu+k

o


• Minimum norm labeling of new example u is given by sign of current interpolator f

• Selects samples near the current decision boundary and closest to oppositely labeled examples

• Yields optimal binary search behavior in one-dimension

Key Properties of RKHS Active Learner:

Kernel Active Learner in One Dimension

+1

-1

Theorem: Consider N points uniformly distributed in [0, 1] and labeled accordingto piecewise constant binary-valued function g(x) with k pieces. Then the Laplacekernel active learner perfectly predicts the labels of all N points after labelingO(k logN) examples.

<latexit sha1_base64="g1nIABefkHQszyLyZ+bKLqSJoMo=">AAAFd3icbZRRbxtFEICvjQ3FEEjhDR4YkQOFCiJfeQBVQmrlVspDFJxQt6lyJtrbm7NX3ttd7e4ltlb3E/hzvPE/eOGNOft8xG1XOu/4m5mduZnZy4wUzg+Hf9+7v9frf/Dhg48GH3+y/+lnBw8/f+V0ZTlOuJbaXmbMoRQKJ154iZfGIiszia+zxajRv75B64RWL/3K4LRkMyUKwZkndP1w78/9/VThrZ+jtlgGY7Wpw5h+tRONSb2jlliWrA6n621H0+51eNkKO1qurZaS2VUdRp24Y5FjIdQmYnj+v7xjQw+zizpcbPb9wSAlN0iTkJbMz7MCEnJYoxctyuDFFo07NK5bx/MOnW+tLjp0sUUnIaScSTipt+RZS551ZNSSUUcuW3LZkTcteVO/5+Q2o0lLJh05bclpR7IqpJmWuVuVtIWqU/B54H882h5elmG0lc/CGZWrqSXX1DqV05m/13ePSZ+j9KwLsdwJsWxCzFdmjsqgYtKvfk2GtKgBSguVo/KQugIGqZPamJVhFu5oPC69K8J6pyZtR+RJXcNIKydytBCfxWDIwzuoaEC1LeUKcroAVmSVxxyEgvhq+EMyjYHyB8kylIQZp9nKhZqB12AEcrwVDoHTuZ5R8EwomrUfb5isyLqoFG/mCuLZ0fL7GG6Fn0O8iDee7hiAclNAAwenzEjGERZoFUqK48UNgkRGfy0YtAVyTznSdcsFp7Qbp3VWDnQBTMq778QKT15rdZNr/NvRAlKpZ3BGWeCSlUZS+MH1weHweLhe8K6QtMJh1K7x9cFfaa55VVKduWTOXSVD46eBWS+4ROpa5dAwvmAzvCJRsRLdNKy/GzV8S4Rqoi09VKo1vesRWOmaCSDL5k64t3UNfJ/uqvLFL9MglKHOKb4JVFSyaVHzEaK+2k3xcsG4pZvOgc+ZpRrTp6opQvL2K78rvHp8nPx0nJw/Pnz6pC3Hg+ir6JvoKEqin6On0Uk0jiYR3/un92XvsBf3/u1/3f+uf7QxvX+v9fki2ln95D+zxNgY</latexit>

Kernel Active Learner in Multiple Dimensions

+1

-1

Strength and Weakness of Max-Min Criterion

Limitation of max-min criterion:

• Can be too myopically focused on learning decision boundary• RKHS norm is insensitive to data distribution

Figure 4: Uniform distribution of samples, smooth boundary, Laplace Kernel, Bandwidth= .1. Onleft, sampling behavior of score(1) and score(2) at progressive stages (left to right). On right, errorprobabilities as a function of number of labeled examples.


4 Interpolating Neural Network Active Learners

Here we briefly examine the extension of the max-min criterion and its variants to neural networklearners. Neural network complexity or capacity can be controlled by limiting magnitude of thenetwork weights [6, 29, 40]. A number of weight norms and related measures have been recentlyproposed in the literature [30, 7, 21, 2, 28]. For example, ReLU networks with a single hidden layerand minimum `2 norm weights coincide with linear spline interpolation [32]. With this in mind,we provide empirical evidence showing that defining the max-min criterion with the norm of thenetwork weights yields a neural network active learning algorithm with properties analagous to thoseobtained in the RKHS setting.

Consider a single hidden layer network with ReLU activation units trained using MSE loss.In Figure 6 we show the results of an experiment implemented in PyTorch in the same settingsconsidered above for kernel machines in Figures 2 and 3. We trained an overparameterized networkwith 100 hidden layer units to perfectly interpolate four training points with locations and binarylabels as depicted in Figure 6(a). The color depicts the magnitude of the learned interpolatingfunction: dark blue is 0 indicating the “decision boundary” and bright yellow is approximately 3.5.Figure 6(b) denotes the score(1) with the weight norm (i.e., the `2 norm of the resulting network

9

max-minpassive






9

Data-Based Norm

Data-based Criterion:

f = min RKHS norm interpolator of {(xi, yi)}ni=1

fu = min RKHS norm interpolator adding u

u⇤ = argmaxu2U

X

x2U

⇣fu(x)� f(x)

⌘2

<latexit sha1_base64="UmDJExsYt0eWFdlEjwVr7roo/AQ=">AAAFXHicjZRPb9MwFMAz2sJWKGwgceFisYG2aZuacgAhTQK6SZWYUDcoFM1r5ThOa82xg+NAK8tfkhsXvgq8tGm2AgcsJX76vX9+71kOEsFT02z+WLlRqdZu3lpdq9++07h7b33j/sdUZZqyHlVC6X5AUia4ZD3DjWD9RDMSB4J9Ci7buf7TV6ZTruQHM03YRUxGkkecEgNouFHRjQaW7JsZM6VZbBOtEme78Fcpz03cklqwOCbOnsy2JU2xO/uhEJa0VGklBNFTZ9uluGQRsojLeUZ7dCUv2cBH9KWzZ/O9Ua9jcEPYtzgmZhxEyAeHGTouUICOF6hboq4rHE9LdLqwOivR2QJ1rMWUCNRxC/K6IK9L0i5IuyT9gvRL8rkgn90/Ihcn6hWkV5KTgpyUJMgsDpQI02kMm81KBR1bOthdBI9j217I7+w7aFfeS6pgdDKEmO/d9TD4iAlDyhSTpRSTPMV4moyZTJgkwkwP/SYsGIBUXIZMGoTTCB0RQ/bz6xiituaGaRjhS1THARtxadkXSbQm011Xj9BTdAgfhvATG3OJzt523iOpdIy4BMdECWKURipCW9huw4GGfG865DvYDS0/9N1AbjmEcT0awEkz97/xSBhyOYKYQTb3B2Gwu/AmegTTnwzzkBiClLNAeA9BhVk8tJNcgXuA3vDRdpF+e7KzH8EvZzuDFsLjNCGU2RaXDupn0PCr6ofrm82D5myhvwW/EDa9YnWH699xqGgWQ5upIGl67jcTc2GJNpwKBqPJUgbpLsmInYMoSczSCzt7HBx6kuXjiKD2SMGYZvS6hyVxmo8ZLPOLn/6py+G/dOeZiV5cWC6TzDBJ54miTCCjUP7SoJBrRo2YgkAo3AZOER0TTSiMI61DE/w/S/5b+Ng68J8d+KetzVcvi3aseo+8x96253vPvVdex+t6PY9WflR+VVera9WftWrtdq0xN72xUvg88JZW7eFvxfXKpQ==</latexit>

select new example that leads to greatest change in interpolating function on the dataset

Max-Min RKHS norm vs. Data-based normSelection of next example to label

u? = argmaxu2U

minnkfu

�k , kfu+k

o


selection using on data-based norm strikes balance between focusing boundary and exploring more globally




u⇤ = argmaxu2U

X

x2U

⇣fu(x)� f(x)

⌘2



u? = argmaxu2U

minnkfu

�k , kfu+k

o


Min-Max and Data-Based Criteria






9

data-basedmax-minpassive

max-min

data-based

data-based criterion has more graceful error decay

Cluster-Seeking Nature of Data-Based CriterionSelection of next example to label

u? = argmaxu2U

minnkfu

�k , kfu+k

o





u⇤ = argmaxu2U

X

x2U

⇣fu(x)� f(x)

⌘2



u? = argmaxu2U

minnkfu

�k , kfu+k

o


focuses more on finding boundaries focuses more on finding clusters

equal to 5 or not. We used Laplace kernel with bandwidth 10 on the vectorized version of a datasetof 500 images. Hence, the dimensionality of each sample is 400. For comparison, we implemented apassive learning algorithm which selects the points randomly, the active learning algorithm usingscore(1) with RKHS norm, and the active learning algorithm using score(2) with data-based functionnorm described in Section 3.5. Figure 8 depicts the decay of probability of error for each of thesealgorithms. The behavior is similar to the theory and experiments above. Both the RKHS normand the data-based norm, score(1) and score(2) respectively, dramatically outperform the passivelearning baseline and the data-based norm results in somewhat more graceful error decay.

0 100 200 300 400 500

number of samples

0

0.1

0.2

0.3

0.4

pro

b.

of

err

or

random

score(1)

score(2)

0 100 200 300 400 500

number of samples

0.1

0.2

0.3

0.4

pro

b.

of

err

or

random

score(1)

score(2)

Figure 8: Probability of error for learning a classification task on MNIST data set. The performanceof three selection criteria for labeling the samples: random slection, active selection based on score(1),and active selection based on score(2). The left curve depicts the probability of error on the trainingset and the right curve is the probability of error on the test set.

11

equal to 5 or not. We used Laplace kernel with bandwidth 10 on the vectorized version of a datasetof 500 images. Hence, the dimensionality of each sample is 400. For comparison, we implemented apassive learning algorithm which selects the points randomly, the active learning algorithm usingscore(1) with RKHS norm, and the active learning algorithm using score(2) with data-based functionnorm described in Section 3.5. Figure 8 depicts the decay of probability of error for each of thesealgorithms. The behavior is similar to the theory and experiments above. Both the RKHS normand the data-based norm, score(1) and score(2) respectively, dramatically outperform the passivelearning baseline and the data-based norm results in somewhat more graceful error decay.

0 100 200 300 400 500

number of samples

0

0.1

0.2

0.3

0.4

pro

b.

of

err

or

random

score(1)

score(2)

0 100 200 300 400 500

number of samples

0.1

0.2

0.3

0.4

pro

b.

of

err

or

random

score(1)

score(2)

Figure 8: Probability of error for learning a classification task on MNIST data set. The performanceof three selection criteria for labeling the samples: random slection, active selection based on score(1),and active selection based on score(2). The left curve depicts the probability of error on the trainingset and the right curve is the probability of error on the test set.

11

MNIST Experiment using Laplace Kernel

data-basedmax-minpassive

training error test error

Conclusions

• theory and methods of active learning are well-developed in the classical statistical learning framework (e.g., VC theory)

• classical theory may not be applicable in overparameterized regime

• new framework for active learning based on minimum norm interpolators shows promise in theory and practice, for both kernel machines and neural networks

• many opportunities to develop new theory for modern deep learning methods and new computationally efficient algorithms for active learning

Thanks!

slides: http://nowak.ece.wisc.edu/ActiveML.html

Recommended Reading

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. "Reconciling modern machine learning and the bias-variance trade-off." arXiv preprint arXiv:1812.11118 (2018).

Zhu, Xiaojin, John Lafferty, and Zoubin Ghahramani. "Combining active learning and semi-supervised learning using gaussian fields and harmonic functions." ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining. Vol. 3. 2003.

Dasarathy, Gautam, Robert Nowak, and Xiaojin Zhu. "S2: An efficient graph based active learning algorithm with application to nonparametric classification." Conference on Learning Theory. 2015.

Tong, Simon, and Daphne Koller. "Support vector machine active learning with applications to text classification." Journal of machine learning research 2.Nov (2001): 45-66.

Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. "Deep bayesian active learning with image data." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

Sener, Ozan, and Silvio Savarese. "Active learning for convolutional neural networks: A core-set approach." arXiv preprint arXiv:1708.00489 (2017).

Savarese, Pedro, Itay Evron, Daniel Soudry, and Nathan Srebro. "How do infinite width bounded norm networks look in function space?." arXiv preprint arXiv:1902.05040 (2019).

Mina Karzand and RN. "Active Learning in the Overparameterized and Interpolating Regime." arXiv preprint arXiv:1905.12782 (2019).

Dasgupta and Hsu. "Hierarchical sampling for active learning." ICML 2008

Documents

Nonparametric Active Learning - Robert Nowak