Close Category GeneralizationClose Category Generalization to unseen categories. Our goal also di ers from adver-sarial robustness because we focus on natural inputs (e.g., real images)

Close Category Generalization for Out-of-Distribution Classification

Yao-Yuan Yang 1 Cyrus Rashtchian 1 Ruslan Salakhutdinov 2 Kamalika Chaudhuri 1

AbstractOut-of-distribution generalization is a core chal-lenge in machine learning. We introduce and pro-pose a solution to a new type of out-of-distributionevaluation, which we call close category gener-alization. This task specifies how a classifiershould extrapolate to unseen classes by consider-ing a bi-criteria objective: (i) on in-distributionexamples, output the correct label, and (ii) on out-of-distribution examples, output the label of thenearest neighbor in the training set. In additionto formalizing this problem, we present a newtraining algorithm to improve the close categorygeneralization of neural networks. We compareto many baselines, including robust algorithmsand out-of-distribution detection methods, and weshow that our method has better or comparableclose category generalization. Then, we investi-gate a related representation learning task, andwe find that performing well on close categorygeneralization correlates with learning a good rep-resentation of an unseen class and with finding agood initialization for few-shot learning.1

1. IntroductionClassifiers encounter inputs at deployment time from cate-gories that they have not seen before, yet the task of gen-eralizing to out-of-distribution data remains a major openproblem (Amodei et al., 2016; Arora et al., 2018; Neyshaburet al., 2017; Wang et al., 2020a). Many objectives have beenproposed to formalize and evalute generalization. Adversar-ial robustness deals with enforcing invariance under smallbut arbitrary perturbations (Szegedy et al., 2013). Anomalyor outlier detection addresses the identification of out-of-distribution examples (Chandola et al., 2009). Few-shotlearning uses a handful of examples from new categoriesduring training (Li et al., 2006).

*Equal contribution 1University of California San Diego2Carnegie Mellon University. Correspondence to: KamalikaChaudhuri <[email protected]>.

1The code is available at https://github.com/yangarbiter/close-category-generalization.

While these objectives are well-motivated, adequate solu-tions remain elusive. One of the challenges of adversarialrobustness is that the adversarial examples lie outside of thedata manifold, and requiring invariance in arbitrary direc-tions seems challenging to achieve in practice (Goodfellowet al., 2014; Utrera et al., 2020). Unsupervised approachesfor anomaly detection only work well when the anoma-lous inputs are far from the support of the training distribu-tion (Carlini & Wagner, 2017; Liang et al., 2017; Lee et al.,2018; Bitterwolf et al., 2020). Few-shot learning assumesthat there is a way to augment the training data with newcategories, which may not always be feasible (Bendre et al.,2020; Fink, 2005; Wang et al., 2020b).

We propose an alternative generalization formulation thatcaptures some of the core benefits of prior problems whilebeing tractable in general. We consider a problem that wecall close category generalization, where the goal is notto detect out-of-distribution examples, but instead to pre-dict the label of the closest class in the training set. Moreprecisely, we evaluate a classifier on two objectives simulta-neously: on in-distribution examples, we measure the testaccuracy, and on out-of-distribution examples, we measurethe fraction of examples that are predicted to have the samelabel as their nearest neighbor in the training set (under aspecified metric). For example, the classifier may be trainedon a subset of CIFAR-10 or ImageNet, where it only seesimages from certain categories at training time. Out-of-distribution test images come from a held out category, andfor these, it should output the label of the nearest neighborin `2 distance. This task seems to be novel and differs fromsimilar objectives. For example, in contrast to anomaly de-tection, we consider out-of-distribution examples that arerelatively close to the support of the training distribution,while belonging to unseen categories. Our goal also differsfrom adversarial robustness because we focus on natural in-puts (e.g., real images) instead of arbitrary perturbations oftest inputs that may be artificial and off of the data manifold.

1.1. Why close category generalization?

The traditional statistical learning framework assumes thatthe training and testing examples come from the same distri-bution. However, in practice, classifiers are fed with inputsfrom categories that they have not seen before. For example,the ImageNet dataset contains images from 2012. A classi-

arX

iv:2

011.

0848

5v2

[cs

.LG

] 1

6 Fe

b 20

21

https://github.com/yangarbiter/close-category-generalization



fier trained on ImageNet might not be able to predict wellon images from 2020, even if they are non-adversarial.

Close category generalization provides a framework for an-alyzing accuracy on unseen classes. The motivation for ourformulation is threefold. First, predicting the label of a closecategory is useful for many applications. For example, if anautonomous car encounters a new street sign, it would bemore appropriate to output the label of a related sign, as op-posed to reporting an arbitrary class label, which is knownto be a common failure mode of neural networks (Bitterwolfet al., 2020; Meinke & Hein, 2019). Second, we observe thatthe sample complexity needed for close category generaliza-tion may be much less than for out-of-distribution detection.Intuitively, it is difficult to determine if an example is veryrare or actually absent from the training distribution. Third,close category generalization provides a consistent mea-sure of generalization accuracy that can be applied to manydomains, using a predefined metric for the nearest neighbor.

Figure 1 provides images labeled with the nearest class inImageNet. The first four pairs from the left depict real-lifemovie stills, classified based on their nearest neighbor inImageNet. Even though the classifier has no knowledge ofthe recent movies, it can still provide a perceptually similarimage class, giving some insight into the new example. Incontrast, the two rightmost pairs show cases where the out-of-distribution example is too far from the training support,and hence, we do not expect good predictions.

1.2. Our Contributions

We first observe that out-of-distribution (OOD) detection isa more general task than close category generalization. Ifwe had a perfect algorithm to detect new examples, then wecould use the following approach: on in-distribution exam-ples, predict using a neural network, and on OOD examples,predict using the 1-NN classifier. However, as we showin Section 2.1, there are simple distributions where detec-tion requires arbitrarily more samples than close categorygeneralization. More precisely, we provide a separation the-orem in the multi-class setting. We posit that close categorygeneralization may be a easier problem in practice as well.

Next, we propose and evaluate a new training method, Den-sity Based Smoothing (DBS), that aims to interpolate be-tween the accuracy of a neural network and the predic-tive consistency of the 1-NN classifier. At a high-level,we impose a consistency constraint on the approximateVoronoi region around training examples. Unfortunately,this makes training intractable, so instead of we considervariable-radius balls around training examples, where theradius at a training point depends on the local data density.

We then investigate which training algorithms lead to goodperformance on close category generalization by evaluating

DBS along with a number of alternative baselines. We findthat overall, algorithms such as DBS, Adversarial Train-ing (Madry et al., 2017) and TRADES (Zhang et al., 2019),which use inputs that lie off the image manifold in their lossfunction, tend to perform better at close category general-ization than natural training. In particular, our DBS methodis competitive with or better than the baselines.

One novel aspect of our algorithm is that, unlike adversar-ial training, we enforce smoothness adaptively. The normball for each example has its own radius depending on thedataset. The adaptive radius is motivated by the sub-Voronoiregion, which is the Voronoi region when restricting to thetraining example along with other examples with differentlabels. The key observations is that the sub-Voronoi regionwould be larger (in a geometric sense) in the sparser (ina distributional sense) regions of the dataset. Therefore,when training, we adaptively tune the radius for each ex-ample based on the distance to nearby oppositely-labeledexamples. In many cases, this leads to using much largerperturbation balls when training for certain examples whencompared to AT or TRADES, which assume a fixed radius.We also see that DBS better matches the decision bound-ary of 1-NN. As the OOD accuracy is calculated based onthe percentage of data that matches the prediction of 1-NN,DBS is better suited for the close category generalizationtask. While similar ideas have been used in the context ofrobustness (Khoury & Hadfield-Menell, 2019), we find thattraining in this way also has benefits for generalization.

Finally, we investigate whether representations that are tai-lored for close category generalization also lead to better per-formance on other out-of-distribution generalization tasks.We consider the popular task few-shot learning, whereK ex-amples are available from the missing category. We look attwo forms: (i) where we use 1-nearest neighbor with respectto the `2 distance on the learnt representation on the trainingdata and a few inputs from the missing category, and (ii)where we fit a small, one or two layer, neural network fromthe learnt representation space to the output space based onthe training data and data from the unseen category. In bothcases, we find that DBS continues to either outperform orremain competitive with the other baselines. This suggeststhat representations that perform well on close categorygeneralization may also adapt readily to few-shot learning,and hence the close category generalization task may bean effective metric for evaluating how well an algorithmgeneralizes on OOD examples.

2. Problem SetupAt training time, we are given a set Dtr of examples labeledwith one of C categories, which we denote as pairs pxi, yiqas well as a representation φ. During testing, we evaluateour classifier on examples that are drawn from a combina-


(a) C-3PO (b) Chewbacca (c) Groot (d) Jarjar (e) Minion (f) uniform noise

(g) breastplate (h) affenpinscher (i) mink (j) hartebeest (k) ski mask (l) strainer

Figure 1: We demonstrate the close category generalization task through six pairs of images. The top row shows OODexamples, and the bottom shows the closest ImageNet training examples in a CNN feature space from ResNeXt101 (Xieet al., 2017). The first five top row images are movie stills that do not belong to the standard classes, but they are naturalimages. The first four pairs on the left show that the label of the nearest neighbors can be reasonable and informative, giventhat the true labels would be impossible to predict without extra training data. The two pairs on the right show that if thenew images are too unnatural (i.e., too far from the training distribution), the results are less relevant.

tion of the training distribution and from a new pC ` 1qstcategory. We assume that the new category still containsnatural examples (e.g., real images collected in a consistentmanner). Formally, let Xφ denote the support of all C ` 1categories in the representation φ. Our goal is to build aclassifier f that maps from Xφ to t1, . . . , C ` 1u . The testaccuracy of f is the fraction of test examples belonging tocategories t1, . . . , Cu that are classified with their correctlabel. Its out-of-distribution (OOD) accuracy is the fractionof test examples from category C ` 1 that are classifiedwith the class of their nearest neighbor in Dtr. Our goal isto ensure that f has both high test accuracy and high OODaccuracy. This is the close category generalization problem.

2.1. Theoretical Motivation

One approach could be used is an out-of-distribution detec-tion algorithm. If we can reliably identify that a test exampleis from a new class, then we can run the 1-NN classifierafterwards. However, this approach may require a largertraining set. We prove the following theorem, showing asituation where it is easy to perform well on close categorygeneralization, but hard to train a good detector.

Theorem 1 For any ε P p0, 12q, d ě 1, and C ě 2, thereexists distributions µ on training examples from C classesin Rd and ν on OOD test examples from outside of supppµqsuch that (i) detecting whether an example is from µ orν requires ΩpCεq samples from µ, while (ii) classifyingexamples from ν with their nearest neighbor label from thesupport of µ requires only OpC logCq samples.

Figure 2 shows intuition for Theorem 1 in the binary case.Out-of-distribution examples come from parts of the realline outside of the colored cubes. Appendix A has the prooffor the general distributions on C classes in Rd.

We provide intuition behind the theorem, namely how togeneralize Figure 2 for more than two classes and for higherdimensional datasets. The main idea is that we translateand replicate the binary dataset and increase the regions tod-dimensional cubes. For the distribution µ, we have 4Ccubes with side length 1

?d. There will be 2 cubes that

have labels from each of the C classes. The high probabilitycubes emit samples with probability « p1´ εqC and thelower probability with« εC. Due to the side lengths being1?d, the 1-nearest neighbor (1-NN) in `2 of the low proba-

bility region is paired with an adjacent high probability box,and hence, it is easy to predict given samples from the highprobability region. By a coupon collector argument, wesee all high probability regions after OpC logCq samples.On the other hand, by the construction of the probabilitydistributions, we need ΩpCεq samples for OOD detection,where ε is the sample probability from a low probabilityregion. For the OOD distribution ν, we will strategicallysample points from outside of all of these cubes (while guar-anteeing that the nearest neighbor labels are still correct).Thus, OpC logCq samples are sufficient for close categorygeneralization, but ΩpCεq are needed for OOD detection.


´1 `1

ε1´ εε 1´ ε

Figure 2: We represent the sample frequency by the size ofthe red/green shapes. With a few samples from each largeprobability region, we can determine the close categorygeneralization label via the large margin solution. On theother hand, out-of-distribution detection requires samplesfrom the small probability regions. Thus, close categorygeneralization requires fewer samples than OOD detection.

3. AlgorithmWe now propose an algorithm that we call Density BasedSmoothing (DBS) for the close category generalization prob-lem. Recall that our goal is to satisfy a dual objective – hightest accuracy on in-distribution inputs as well as high agree-ment with the nearest neighbor on out-of-distribution inputs.For simplicity of notation, we assume that the training datais given to us in the φ-space – in other words, x P Xφ.

We begin by motivating a loss function that captures thesegoals, and then describe the Density Based Smoothing(DBS) algorithm that approximately minimizes it. The losshas two terms. The first is the standard cross-entropy loss,which ensures that correct labels are predicted on the train-ing data. The second term enforces that OOD examplesin a region around each training input have similar predic-tion as their closest training example; this ensures that thepredictions change slowly in the vicinity of the trainingdata and encourages smoothness. The scope of smoothingis determined by the local density; the region of smooth-ing is smaller around an input that is close to the decisionboundary, and larger for far away points.

Specifically, for a training example pxi, yiq, we consider theregion containing all inputs that are closer to xi than anyexample with a label different from yi. This is equivalentto finding the Voronoi cell of xi in the set txiu Y X‰yi ,where X‰yi “ tx | y ‰ yi and px, yq P Dtru is the setof examples that has a label different from yi. We call thisregion the sub-Voronoi region of pxi, yiq.

To measure smoothness in each Vi, inspired by Zhang et al.(2019), we use the maximum KL divergence in the sub-Voronoi region of each xi; this leads to a loss function termmaxx1iPVi

DKLpfθpx1iq, fθpxiqq. A small value implies that

for every example x1i P Vi, fθpx1iq « fθpxiq. Combiningthe two terms with a trade-off parameter β gives the full lossfunction `pfθpxiq, yiq ` βmaxx1iPVi

DKLpfθpx1iq, fθpxiqq.

Observe that while this resembles the TRADES (Zhanget al., 2019) loss function, a major difference is the density-dependent smoothing region for each training point.

Figure 3 (b), (c), and (d) provides intuition for why smooth-ing works, and in particular, why smoothing in the sub-Voronoi region leads to nearly optimal results. We comparethree methods (Natural, TRADES, and our method DBS).All perform well on the in-distribution examples. TRADESpartially generalizes to the yellow examples, which arecloser to the support of the purple class, but it still splits de-cisions on this new class. Optimizing over the sub-Voronoiregion does even better. It correctly classifies all of theout-of-distribution examples with their closest category byputting the purple boundary in the correct location.

Minimizing the loss function. Directly minimizing theloss function is computationally challenging because ofthe second term; we therefore make some approxima-tions. We compute the inner maximization using the pro-jected gradient descent algorithm (PGD) (Kurakin et al.,2016). PGD is initialized as: x

p1qi “ xi. In iteration t,

we take a gradient step on x1i with step size α towardsmaximizing the KL divergence term (formally: x

ptqi “

xpt´1qi `α∇x1i

DKLpfθpx1iq, fθpxiqq), and then project xptqi

onto the sub-Voronoi region Vi. After T iterations, weuse x

pT qi as the solution to the inner maximization and

update the parameter θ by a stochastic gradient step on`pfθpxiq, yiq `DKLpfθpx

pT qi q, fθpxiqq.

Approximating the sub-Voronoi region. Directly project-ing onto the sub-Voronoi region involves solving a compu-tation intensive quadratic program. We therefore relax thisconstraint by approximating Vi by a ball. We denote Bpx, rqas the ball centered at x with radius r We approximate Viwith a ball Bpxi, εmaxi q, whose radius εmaxi is set to half thedistance from xi to its closest differently-labeled exampleas a reflection of the local density. We pre-compute thisradius for each training example before training begins.

Ideally, X‰yi should be composed of examples from thedistribution that are labeled differently from as yi. In prac-tice, we only have the training set. Since more examplesin X‰yi will only further shrink each ball, we introduce aparameter 0 ď λ ď 1 to control the shrinkage. This leads tothe final loss function:

`pfθpxiq, yiq ` β maxx1iPBpxi,λεmax

i qDKLpfθpx

1iq, fθpxiqq

Here β is the tradeoff parameter; ` is cross-entropy loss.

Smoothness in large radius balls. The radius of the ballλεmaxi can be large when examples are far apart. Chenget al. (2020); Sitawarin et al. (2020) report that enforcingsmoothness in a large radius ball can be difficult and pro-posed methods that can diffuse some of these challenges.We adopt two methods from these works. First, for eachexample xi, we set its radius εi = 0 and then gradually in-crease εi with a step size η after each epoch. Second, ifthe prediction within the ball centered at xi with radius


(a) region diagram (b) natural (c) TRADES (d) Density-Based Smoothing

Figure 3: (a) A diagram showing the difference between the sub-Voronoi region (V) and the ball B used to approximateit. In figure (b), (c), and (d), we plot the decision boundary of neural networks trained with natural training, TRADES,and enforcement on the smoothness in V . The yellow examples are the OOD examples, and they are closer to the purpleexamples. In (b) and (c), we see that the predictions on the yellow examples are not consistent to the nearest neighbor; onthe other hand, in (d), the yellow examples are predicted as purple.

εi is not smooth enough, we decrease εi by η. We set athreshold thresh to determine whether it is smooth enough.Combining these methods, we get the final Density BasedSmoothing (DBS) algorithm. The pseudocode is in Ap-pendix E.

4. ExperimentsIn this section, we investigate to what extent the dual goalsof close category generalization are achievable by evaluatinghow different neural network training algorithms, includ-ing DBS, perform on this task. We then examine whetherhigh performance on close category generalization is corre-lated with performance on other related out-of-distributiongeneralization tasks by looking at how well the representa-tions learnt perform on the popular few-shot learning task.Specifically, we ask the following questions:

• How well do different training algorithms perform onthe close category generalization task?

• Do networks that perform well on close category gen-eralization also do well on few-shot learning?

4.1. Setup

Data. We use MNIST (LeCun et al., 2010), CIFAR-10 (Krizhevsky, 2009), and CIFAR-100 (Krizhevsky, 2009).We simulate the close category generalization task as fol-lows. We start with the original training and testing setdenoted by Dtr and Dts, respectively. We choose a cate-gory as the unseen category and remove all examples inthis category from Dtr and Dts. The removed examplesare combined into the OOD set rDood. We use the remain-ing train/test examples as the new training and testing set,denoted by rDtr and rDts, respectively.

Unseen categories. For MNIST and CIFAR-10, we chooseeach of the 10 classes one by one as the unseen category,and run 10 separate experiments. The results for a canonical

category are reported in the main paper and the rest in Ap-pendix D. CIFAR-100 has 20 super-classes; we pick the top5 super-classes in alphabetical order as the unseen category,leading to 5 experiments in total. Again, a canonical cate-gory is presented in the paper (full results in Appendix D).

Evaluation. For the close category generalization task,we look at two metrics – the test accuracy and the out-of-distribution (OOD) accuracy. OOD accuracy measuresthe percentage of the neural network’s prediction that matchthe predictions of a 1-NN classifier on OOD examples.

Representations. For the 1-NN classifier, we use `2 dis-tance in the representation space φ, and we experiment witha few natural choices for φ. In addition to the usual pixelspace, we also look at the feature space of a CNN thatconveys more semantic information.

For CIFAR-10 and CIFAR-100, we use the output of anintermediate convolutional layer as well as the last convolu-tional layer for CNN features. For MNIST, we use only thelast convolutional layer since performance is already goodin pixel space. More details regarding the experimentalsetup are in Appendix C. We extract features from CNNstrained on rDtr, without knowledge of the unseen category.

4.2. Close category performance

We consider two types of baselines. First, we directly usethe predictions of the networks themselves. Second, we usea filtering step – first learn a detector to determine whetheran input is in- or out-of-distribution, and then process itwith a neural network or a 1-NN classifier depending on thedetection result.

4.2.1. BASELINES

Training Methods. Broadly speaking, we can categorizeneural network training methods into two categories: on-and off-manifold training. On-manifold training learns a


1-NN natural mixup AT TRADES DBS natural +thresh-90%

DBS +thresh-90%

natural +one-class SVM

DBS +one-class SVM

MNIST

pixeltrain acc. 1.00 1.00 1.00 1.00 0.99 0.98 - - 1.00 0.98test acc. 0.97 1.00 1.00 0.99 0.99 0.98 1.00 0.98 1.00 0.98OOD acc. 1.00 0.59 0.58 0.71 0.70 0.72 0.59 0.72 0.58 0.70

last layertrain acc. 1.00 1.00 1.00 1.00 1.00 0.99 - - 1.00 1.00test acc. 0.99 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.99 0.99OOD acc. 1.00 0.69 0.76 0.71 0.73 0.81 0.69 0.81 0.97 0.98

CIFAR-10


mid layertrain acc. 1.00 1.00 1.00 1.00 1.00 0.77 - - 1.00 0.94test acc. 0.38 0.83 0.84 0.83 0.83 0.75 0.81 0.74 0.68 0.65OOD acc. 1.00 0.37 0.34 0.37 0.37 0.44 0.47 0.53 0.75 0.76


CIFAR-100


mid layertrain acc. 1.00 1.00 0.99 1.00 1.00 0.48 - - 1.00 0.91test acc. 0.21 0.51 0.58 0.51 0.51 0.46 0.47 0.42 0.28 0.27OOD acc. 1.00 0.12 0.12 0.11 0.12 0.15 0.15 0.18 0.84 0.85


Table 1: The close category generalization results under `2 distance. The canonical unseen category for MNIST andCIFAR-10 are digit 9 and airplane, respectively. For CIFAR-100, we use the coarse labeling and the canonical unseencategory is aquatic mammal. We omit training accuracies for thresh-90% because the distance of a training example to theclosest training example is always 0.

neural network by incorporating examples that are on thedata manifold into the loss function. We use two such meth-ods – natural training and mixup (Zhang et al., 2017); thelatter is a data-augmentation method that trains on convexcombinations of pairs of examples and their labels. In off-manifold training, the loss function involves inputs that alsolie off the natural data manifold. We consider three suchmethods: adversarial training (AT) (Madry et al., 2017),TRADES (Zhang et al., 2019), and density-based smooth-ing (DBS). AT directly incorporates off-manifold examplesinto the cross-entropy loss while TRADES and DBS useoff-manifold examples in their regularization term. To keepthe setup simple, we do not use any other data-augmentationduring training.

Detection. A plausible baseline for solving close categorygeneralization is to learn a detector to determine whether aninput is in- or out-of-distribution. If the input is determinedas an in-distribution example, we use a neural network forprediction. Otherwise, we use a 1-NN classifier for pre-diction. Most out-of-distribution detectors use some sortout-of-distribution data during training; we use two simpleones that do not. For the first, we use the distance to theclosest training example as a feature for detection and seta threshold to determine whether the input is in- or out-of-distribution. If this distance is greater then the threshold,the example is considered as an OOD example. We set the

threshold to be the 90-th percentile of the distance betweenall the training examples to their closest training examplesthat are differently labeled (thresh-90%). For the seconddetector, instead of using a threshold, we train a one-classSVM (Manevitz & Yousef, 2001) as the detector.

Hyper-parameters. For mixup, we set the parameter α to1.0. For AT and TRADES, we set the robustness radius to2.0, which is commonly used for `2 distance. For TRADESand our method, we set the parameter β “ 6. For DBS,we consider λ “ 1, 12 ,

13 ,

15 ,

110 and choose the largest λ

that does not underfit (training accuracy ă 95%). If allunderfit, we set λ “ 1

10 . An ablation study on λ’s value isin Appendix D.

4.2.2. RESULTS

The left half of Table 1 shows the results for direct predic-tion using different training methods. We find that in thepixel space, across all datasets, off-manifold training meth-ods (AT, TRADES and DBS) generally give better OODaccuracy over the on-manifold training methods (naturaland mixup), albeit with a drop in test accuracy. Among off-manifold training methods, DBS performs better than or iscompetitive with AT and TRADES in the pixel space. Whenthe representation is the mid or last layer of a CNN, DBSperforms the best in terms of OOD accuracy, while AT and


TRADES perform poorly – thus illustrating the benefits ofadaptive smoothing. This implies that off-manifold trainingthat can adapt to the data density can be a good direction forsolving close category generalization tasks.

The right half of Table 1 shows the results for combining twostandard OOD detection methods with natural training andDBS. For thresh-90%, the OOD accuracy either stays thesame or improves a little while the test accuracy decreasesa little. For one-class SVM, there is an increase in OODaccuracy in the pixel space of CIFAR-10 and CIFAR-100,and an overall large drop in test accuracy. We also find thatthe OOD detection rate of one-class SVM is low (between35% to 54%, whereas the chance level is 50%), and hencethis result may be due to directing more examples to 1-NNfor prediction. Comparing with thresh-90%, it appears thatthresh-90% may be a better choice as it keeps most of thetest accuracy while improving OOD accuracy. These resultsindicate that a detector alone combined with natural trainingcannot solve close category generalization well enough. De-tailed results on the performance of the training baselinesfor other unseen categories are presented in Appendix D.We see that the overall results are similar.

4.3. Performance on Few Shot Learning

We now investigate whether representations that have highperformance on close category generalization also achievebetter performance on other related out-of-distribution gen-eralization tasks – specifically, few-shot learning.

Setup. In K-shot learning, we first learn a representationusing a training set with examples from the unseen categoryremoved. This representation is then used in conjunctionwith K randomly selected examples from the unseen cate-gory to learn a classifier on all C ` 1 classes (the classesin the training data plus the unseen one). We consider twotypes of classifiers – first, a 1-nearest neighbor (1-NN) clas-sifier on the training data plus K examples from the unseencategory. The second is a neural network classifier that con-sists of all layers of the original network after the last CNNlayer; this is a linear classifier for CIFAR-10 and -100, anda two layer MLP for MNIST. We replace the output layerwith C`1 nodes instead of C, initialize the new layers withrandom weights, and then train these layers on the trainingdata rDtr plus K OOD examples. As is customary withfew-shot learning, we evaluate both classifiers on their accu-racy on examples from the unseen category. We repeat theexperiments ten times and report their mean and standarderrors. More details as well as results with other unseencategories are in Appendix C.

Results. Table 2 shows the results for a canonical unseencategory with more detailed results in the Appendix. For1-NN, we see that DBS performs best for CIFAR-10 andCIFAR-100. Mixup performs best for MNIST with DBS a

close second. Overall, we find that the off-manifold trainingalgorithms do better for CIFAR-10 and CIFAR-100, whilethe performance is more or less similar in MNIST. For theneural network, we find that accuracy of all algorithms in-creases significantly onK-shot learning. DBS still performsthe best on CIFAR-10 and CIFAR-100; TRADES performsbest on MNIST with mixup and DBS close behind. Theoff-manifold training algorithms overall do better than theon-manifold ones. All in all, this shows that the represen-tations that do better on close category generalization alsolead to higher performance on K-shot learning.

Detailed results on the performance of the training baselinesfor other unseen categories are presented in Appendix D.We see that the overall results are also similar.

4.4. When does DBS perform well on OOD examples?

We provide another experiment in Appendix D demonstrat-ing that some OOD examples are easier to predict correctlythan others. In particular, the difficulty of predicting anOOD example depends on the distance to the closest train-ing example. To measure this, we bin the OOD examplesbased on their distance to the closest training example, andwe evaluate the OOD accuracy in each bin.

Across all methods, the OOD accuracy is higher when OODexamples are closer to the training examples. Therefore,the close category generalization problem is harder whenin- and out-of-distribution examples are farther apart. Fortu-nately, far away OOD examples are exactly where existingdetection methods work well, and examples that belong toclose categories are where they perform poorly (Liang et al.,2017). This also motivates our use of the distance-basedthresholding method that is used for Table 1.

4.5. Discussion

Our experimental results lead to two main observations.First, we see that on close category generalization task, on-and off- manifold training algorithms behave differently –off-manifold training methods have better OOD accuracywhile on-manifold training algorithms have higher test ac-curacy. Overall, DBS performs better than or is competi-tive with the other off-manifold training methods. There isalso an observed trade-off between test and OOD accuracy,where an algorithm that is better on OOD accuracy often per-forms worse on test. An interesting avenue of future workis to develop new methods to improve upon this trade-off.

Second, we see that the training methods that perform wellon close category generalization also have high performanceon the few-shot learning task. To perform well on the few-shot learning task, the algorithm will have to generate arepresentation that generalizes well on OOD examples. Thisindicates that the close category generalization task can be


MNIST CIFAR-10 CIFAR-100

# shots 10 20 100 1000 10 20 100 1000 10 20 100 1000

1-NN

natural .22˘ .02 .35˘ .00 .70˘ .00 .94˘ .00 .04˘ .01 .08˘ .01 .26˘ .01 .64˘ .00 .02˘ .00 .02˘ .01 .10˘ .01 .45˘ .00mixup .41˘ .02 .58˘ .01 .85˘ .00 .96˘ .00 .03˘ .00 .06˘ .00 .21˘ .00 .58˘ .00 .02˘ .00 .04˘ .00 .16˘ .00 .54˘ .00AT .30˘ .02 .46˘ .00 .78˘ .00 .95˘ .00 .06˘ .01 .12˘ .01 .34˘ .00 .72˘ .00 .02˘ .00 .04˘ .01 .16˘ .01 .54˘ .00TRADES .35˘ .02 .51˘ .00 .81˘ .00 .96˘ .00 .15˘ .01 .24˘ .01 .46˘ .00 .76˘ .00 .01˘ .00 .03˘ .00 .10˘ .00 .45˘ .00DBS .26˘ .02 .41˘ .00 .75˘ .00 .94˘ .00 .25˘ .01 .35˘ .01 .60˘ .00 .84˘ .00 .07˘ .01 .12˘ .01 .34˘ .00 .76˘ .00

neural network

natural .58˘ .02 .74˘ .01 .90˘ .00 .97˘ .00 .59˘ .01 .70˘ .01 .80˘ .00 .82˘ .00 .30˘ .01 .42˘ .01 .54˘ .00 .58˘ .00mixup .73˘ .01 .81˘ .01 .93˘ .00 .98˘ .00 .54˘ .01 .59˘ .01 .63˘ .00 .64˘ .00 .63˘ .01 .69˘ .00 .72˘ .00 .74˘ .00AT .70˘ .01 .79˘ .01 .92˘ .00 .98˘ .00 .51˘ .01 .62˘ .01 .72˘ .00 .74˘ .00 .36˘ .01 .43˘ .01 .50˘ .00 .52˘ .00TRADES .75˘ .01 .82˘ .01 .93˘ .00 .98˘ .00 .72˘ .02 .78˘ .01 .82˘ .00 .83˘ .00 .27˘ .01 .33˘ .01 .39˘ .00 .40˘ .00DBS .72˘ .01 .81˘ .01 .92˘ .00 .98˘ .00 .78˘ .01 .83˘ .01 .88˘ .00 .89˘ .00 .62˘ .01 .74˘ .01 .84˘ .00 .86˘ .00

Table 2: Mean & standard error of accuracy on OOD examples for few-shot learning (largest in bold).

an effective metric for evaluating how well an algorithmgeneralizes on OOD examples.

5. Related WorkOff-distribution generalization – or, generalization to datathe likes of which has not been seen before – has longbeen a major hurdle in the practical deployment of machinelearning. The main challenge here is that off-distributiongeneralization is impossible in its full generality, and thusany solution has to add some inductive bias that charac-terizes how the solution would behave on data outside thedistribution. There has been a number of lines of workthat look at different ways of adding this inductive bias. Apopular method for adding inductive bias is through dataaugmentation, e.g., augmenting training data with rotatedand scaled versions of the images (Shorten & Khoshgof-taar, 2019). While this enforces correct behavior on knownvariations of training data, it does not accommodate unseenvariations and categories.

In transfer learning (Yosinski et al., 2014; Salman et al.,2020; Utrera et al., 2020), data is available from both asource and target domain. The challenge is to repurpose aclassifier built for a source domain into a target one. Anexample is transferring a classifier for general object recog-nition to medical images with a small amount of medicalimage data (Raghu et al., 2019). Here, unlike us, data fromthe target domain is used to add inductive bias. Domainadaptation (Wang & Deng, 2018), covariate shift (Bickelet al., 2009) and few-shot learning (Dhillon et al., 2019;Wang et al., 2020b; Koch et al., 2015) fall into this category.

Another line of work involves detection of off-distributionexamples – such as anomaly or outlier detection and changepoint detection – typically with the goal of passing on thedetected examples to a human (Manevitz & Yousef, 2001;Liang et al., 2017; Ren et al., 2019). These methods workwell when the off-distribution examples are highly dissimilarfrom the data distribution, and are orthogonal to our setting.

Other work includes causal inference, which is challengingor even impossible without strong assumptions that we donot use (Arjovsky et al., 2019; Mahajan et al., 2020), andadversarial robustness, where the goal is to build classifiersthat are locally Lipschitz or smooth in a given radius aroundtraining data (Goodfellow et al., 2014; Madry et al., 2017;Yang et al., 2020). While close category generalization isrelated to adversarial robustness, the difference is that do nothave a pre-defined radius, and we only care about the neuralnetwork being smooth along the image manifold since closecategory examples are still natural images.

6. ConclusionWe introduced a new generalization objective that we callclose category generalization, which is motivated by de-veloping a consistent and formal framework for evaluatingthe predictions of classifiers on out-of-distribution exam-ples. We provided theoretical and experimental evidencethat close category generalization is easier to solve thanout-of-distribution detection. Then, we exhibited a noveltraining algorithm, DBS, based on approximating the sub-Voronoi region by using balls with data-dependent radii.Experimentally, we showed that our method is comparableto or better than existing training methods for close categorygeneralization. Finally, we showed that performing wellin terms of close category generalization also implies thatthe learned representation can more easily adapt to unseencategories in a few-shot learning setting.

For future work, close category generalization is both aninteresting and tractable benchmark objective for encourag-ing better generalization on real classification tasks. Whatare the best training methods? Can we provide connectionsto or separations from other objectives, such as robustnessor transfer learning? Is it possible to achieve both hightest accuracy and good close category generalizationby tran-sitioning between classifiers based on the distance to thetraining distribution?


AcknowledgementsWe thank Angel Hsing-Chi Hwang and Mary Anne Smartfor providing thoughtful comments on the paper. Ka-malika Chaudhuri and Yao-Yuan Yang thank NSF un-der CIF 1719133 and CNS 1804829 for support. Thiswork was also supported in part by NSF IIS1763562 andONR Grant N000141812861.

ReferencesAmodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-

man, J., and Mane, D. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprintarXiv:1907.02893, 2019.

Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Strongergeneralization bounds for deep nets via a compressionapproach. arXiv preprint arXiv:1802.05296, 2018.

Bendre, N., Marın, H. T., and Najafirad, P. Learning fromfew samples: A survey. arXiv preprint arXiv:2007.15484,2020.

Bickel, S., Bruckner, M., and Scheffer, T. Discriminativelearning under covariate shift. Journal of Machine Learn-ing Research, 10(9), 2009.

Bitterwolf, J., Meinke, A., and Hein, M. Provable worstcase guarantees for the detection of out-of-distributiondata. arXiv preprint arXiv:2007.08473, 2020.

Carlini, N. and Wagner, D. Adversarial examples are noteasily detected: Bypassing ten detection methods. InProceedings of the 10th ACM Workshop on ArtificialIntelligence and Security, pp. 3–14, 2017.

Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec-tion: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.

Cheng, M., Lei, Q., Chen, P.-Y., Dhillon, I., and Hsieh,C.-J. Cat: Customized adversarial training for improvedrobustness. arXiv preprint arXiv:2002.06789, 2020.

Dhillon, G. S., Chaudhari, P., Ravichandran, A., and Soatto,S. A baseline for few-shot image classification. arXivpreprint arXiv:1909.02729, 2019.

Fink, M. Object classification from a single example uti-lizing class relevance metrics. In Advances in neuralinformation processing systems, pp. 449–456, 2005.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572, 2014.

Johnson, J., Douze, M., and Jegou, H. Billion-scale similar-ity search with gpus. arXiv preprint arXiv:1702.08734,2017.

Khoury, M. and Hadfield-Menell, D. Adversarialtraining with voronoi constraints. arXiv preprintarXiv:1905.01019, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neuralnetworks for one-shot image recognition. 2015.

Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, 2009.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial ma-chine learning at scale. arXiv preprint arXiv:1611.01236,2016.

LeCun, Y., Cortes, C., and Burges, C. Mnist hand-written digit database. ATT Labs [Online]. Available:http://yann.lecun.com/exdb/mnist, 2, 2010.

Lee, K., Lee, K., Lee, H., and Shin, J. A simple unifiedframework for detecting out-of-distribution samples andadversarial attacks. In Advances in Neural InformationProcessing Systems, pp. 7167–7177, 2018.

Li, F.-F., Fergus, R., and Perona, P. One-shot learning ofobject categories. IEEE transactions on pattern analysisand machine intelligence, 28(4):594–611, 2006.

Liang, S., Li, Y., and Srikant, R. Enhancing the reliabilityof out-of-distribution image detection in neural networks.arXiv preprint arXiv:1706.02690, 2017.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. arXiv preprint arXiv:1706.06083,2017.

Mahajan, D., Tople, S., and Sharma, A. Domain gen-eralization using causal matching. arXiv preprintarXiv:2006.07500, 2020.

Manevitz, L. M. and Yousef, M. One-class svms for doc-ument classification. Journal of machine Learning re-search, 2(Dec):139–154, 2001.

Meinke, A. and Hein, M. Towards neural networks thatprovably know when they don’t know. arXiv preprintarXiv:1909.12180, 2019.

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro,N. Exploring generalization in deep learning. In Advancesin neural information processing systems, pp. 5947–5956,2017.


Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.Scikit-learn: Machine learning in Python. Journal ofMachine Learning Research, 12:2825–2830, 2011.

Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham,K., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adver-sarial robustness through local linearization. In Advancesin Neural Information Processing Systems, pp. 13847–13856, 2019.

Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Trans-fusion: Understanding transfer learning for medical imag-ing. In Advances in neural information processing sys-tems, pp. 3347–3357, 2019.

Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo,M., Dillon, J., and Lakshminarayanan, B. Likelihood ra-tios for out-of-distribution detection. In Advances in Neu-ral Information Processing Systems, pp. 14707–14718,2019.

Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry,A. Do adversarially robust imagenet models transferbetter? arXiv preprint arXiv:2007.08489, 2020.

Shorten, C. and Khoshgoftaar, T. M. A survey on imagedata augmentation for deep learning. Journal of Big Data,6(1):60, 2019.

Sitawarin, C., Chakraborty, S., and Wagner, D. Improv-ing adversarial robustness through progressive hardening.arXiv preprint arXiv:2003.09347, 2020.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing properties ofneural networks. arXiv preprint arXiv:1312.6199, 2013.

Utrera, F., Kravitz, E., Erichson, N. B., Khanna, R., andMahoney, M. W. Adversarially-trained deep nets transferbetter. arXiv preprint arXiv:2007.05869, 2020.

Wang, G., Yang, S., Liu, H., Wang, Z., Yang, Y., Wang, S.,Yu, G., Zhou, E., and Sun, J. High-order informationmatters: Learning relation and topology for occludedperson re-identification. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition,pp. 6449–6458, 2020a.

Wang, M. and Deng, W. Deep visual domain adaptation: Asurvey. Neurocomputing, 312:135–153, 2018.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Generalizingfrom a few examples: A survey on few-shot learning.ACM Computing Surveys (CSUR), 53(3):1–34, 2020b.

Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. Aggre-gated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vi-sion and pattern recognition, pp. 1492–1500, 2017.

Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R.,and Chaudhuri, K. A closer look at accuracy vs. robust-ness. arXiv preprint arXiv:2003.02460, 2020.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. Howtransferable are features in deep neural networks? InAdvances in neural information processing systems, pp.3320–3328, 2014.

Zagoruyko, S. and Komodakis, N. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,D. mixup: Beyond empirical risk minimization. arXivpreprint arXiv:1710.09412, 2017.

Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jor-dan, M. I. Theoretically principled trade-off between ro-bustness and accuracy. arXiv preprint arXiv:1901.08573,2019.


A. Proof of Separation TheoremAs a warm-up, we prove Theorem 1 for R and C “ 2 classes. We will use this as a building block for the general result.

A.1. Warm-up: binary case

For this special case, the universe for the examples will be the real line R, and we consider a binary classification task with athird category that only appears in the testing distribution. Let ε P p0, 12q be a parameter. For the training distribution µ,we define four regions:

1. Positive, large probability. Let P0 “ r1, 2s, labeled as “`”.

2. Positive, small probability. Let P1 “ r3, 4s, labeled as “`”.

3. Negative, large probability. Let N0 “ r´2,´1s, labeled as “´”.

4. Negative, small probability. Let N1 “ r´4,´3s, labeled as “´”.

To sample from the training distribution µ, we first set ` P t´1, 1u randomly with equal probability. Then, we choosei P t0, 1u, where i “ 0 with probability 1´ ε and i “ 1 with probability ε. If ` “ 1, sample a point x uniformly from Pi,and otherwise, if ` “ ´1, we sample uniformly from Ni. Note that with probability 1´ ε, we have that x P P0 YN0, whilethe probability of seeing any point in P1 YN1 is only ε. Finally, let ν be the uniform distribution on r´6,´5s Y r5, 6s,where for x „ ν, we label it as signpxq.

We first argue that close category generalization can be efficiently solved. During training time, if we see at least 32 samplesfrom µ, then with probability at least 99%, we will see samples from both P0 and N0, since 1 ´ ε ą 12, and we seesamples from each class with equal probability. Therefore, once we have at least one sample from each class, we canconstruct the classifier decides ˘1 based on the midpoint of the training examples (which will be between ´2 and `2 withgood probability). Then, on the testing distribution ν, we see that all points will be classified correctly with the label of theirnearest neighbor in the support of µ.

Turning to out-of-distribution detection, we claim that Ωp1εq samples are necessary. Indeed, to distinguish whether a samplecomes from ν or from P1YN1, we must see at least one sample from each of P1 and N1, since the support of ν is unknownat training time. As the probability of sampling from P1 or N1 is only ε, we will miss one of these regions with probability99% if we have fewer than t “ 1p100εq samples from µ. Indeed, with probability p1´ εqt ě e´εt “ e´.01 ą 0.99, wehave that all the samples come from P0 YN0.

A.2. General case

We now provide the proof of Theorem 1 for any number C ě 2 of classes and for any d ě 1 dimensional dataset in Rd withnearest neighbors measured in `2 distance.

For j P t1, 2, . . . , Cu we define the following centers

aj0 “ 1` 10j and aj1 “ 3` 10j and aj2 “ 5` 10j,

where we naturally embed them in d dimensions by using these as the value of the first coordinate and setting the rest of thecoordinates to be zero. In other words, we define aji “ aji ¨ e1, where e1 is the standard basis vector, so that aji P Rd.

Then, for i P t0, 1, 2u and j P t1, 2, . . . , Cu, we define the following regions, which are cubes centered at the points definedabove and have side length 1

?d. Formally, we consider the d-dimensional cubes

Aji “

!

aji ` px1, x2, . . . , xdq | 0 ď xk ď 1?d

)

.

To sample from the training distribution µ, we first choose ` P t1, 2, . . . , Cu uniformly at random. Then, we choosei P t0, 1u, where i “ 0 with probability 1 ´ ε and i “ 1 with probability ε. Given our choice of `, we sample a point xuniformly from A`

i . Note that with probability 1´ ε, we have that x PŤCj“1 A

j0, while the probability of seeing any point


inŤCj“1 A

j1 is only ε. Finally, let ν be the uniform distribution on

ŤCj“1 A

j2. For both distributions, we label x as j if it

comes from Aji for any i P t0, 1, 2u.

Notice that this definition with j “ 0 corresponds to the positively labeled regions pr1, 2s, r3, 4s, r5, 6sq from the proof ofthe binary case in the previous subsection. The probabilities are also the same when C “ 2.

We explain the key properties of these regions, and then we prove the sample complexity results claimed in the theoremstatement. First, for any i P t0, 1, 2u and j P t1, 2, . . . , Cu, if x,y P Aj

i , then x´ y2 ď 1 because each Aji is a cube with

side length 1?d in Rd.

Next, consider x P Aj2. We claim that x is closer to Aj

0 than to any point z P Aj1

0 YAj1

1 for any j1 ‰ j. To see this, we cancheck that the triangle inequality implies that

minyPAj

0

x´ y2 ď 4` 1 “ 5,

while, since the centers satisfy |aj2 ´ aj1

1 | ą |aj2 ´ a

j1

0 | ě 6, we also have that for j1 ‰ j,

minzPAj1

0 YAj1

1

x´ z2 ě 6.

As a consequence, we have that the nearest neighbor in `2 distance for any point x P Aj2 has the same label j as x does. In

particular, this implies that we can solve the close category generalization problem for points sampled from ν. To do so, wefirst sample ΘpC logCq points from µ, so that by a coupon collector argument, we see at least one point from Aj

0 for eachj P t1, 2, . . . , Cu. Then, recall that ν is supported on the union of Aj1

2 over j1 P t1, 2, . . . , Cu. By the above calculations,we have that the nearest neighbor for a point x P Aj

2 is some point from either Aj0 or Aj

1. Therefore, since we have sampledat least one point from Aj

0, we can correctly determine that x has label j by computing the nearest neighbor in our sampledpoints. To be more precise, we can compute the multi-class large-margin classifier, where we have sequential decisionregions (corresponding roughly to the centers defined above), setting the decision boundaries to be equally spaced betweensamples from adjacent regions (i.e., the natural generalization of the 1D large-margin solution). Importantly, this solutiondoes not require any extra knowledge of the support of µ and ν because it can be computed directly from the samples (andwe have argued that with ΘpC logCq samples, we will see all C classes at least once).

We turn our attention to our lower bound, which is that we need at least ΩpCεq samples to solve the OOD detectionproblem. More precisely, we provide a lower bound for the number of samples to guarantee that we see that least one pointfrom each region Aj

1 for each j P t1, 2, . . . , Cu. This is a prerequisite for solving the OOD detection problem, becauseotherwise, we cannot tell whether a point comes from µ or ν without prior knowledge of the regions. For the lower bound,we use the same argument as in the binary case in the previous subsection. This implies that we need ΩpCεq samples to seeone from Aj

1 for each fixed j since the probability of sampling from this region is εC by the definition of µ.

A.3. Alternative generalizations

We could also use a “noisy one-hot encoding” to prove the theorem, replicating and rotating the 1D dataset log2 C times,to get a subset of Rlog2 C for C classes. One dimension is non-zero for each point, and each dimension has points fromtwo possible labels (C total labels). Use 6C regions to define the low probability, high probability, and OOD regions (6 ineach dimension with 3 for each class). Again, by a coupon collector argument, we will see some point from each of thehigh probability regions after OpC logCq samples. This enables close category generalization. On the other hand, for OODdetection, we need ΩpCεq samples, where ε is the sample probability from a low probability region.

Instead of boxes, we could use Gaussian distributions with covariance σ2Id and means shifted by increments of a vector,spacing out the means by distance Ωpσ

a

logpdεqq to get analogous guarantees. Similar ideas work for Hamming distance ont0, 1ud; embed regions as intervals in the partial order along a path from 0d to 1d, spacing them out to ensure 1-NN properties.In general, there are many metric spaces where we can provide a separation between close category generalization and OODdetection by correctly setting up the regions and sampling probabilities. Therefore, we believe it a general phenomena thatclose category generalization is a more tractable goal, in terms of sample complexity, than OOD detection.


B. Image source of the images in Figure 1Images retrieved online:

• c3po: https://hips.hearstapps.com/digitalspyuk.cdnds.net/16/46/1479397679-c-3po-see-threepio-68fe125c.jpeg?crop=1xw:1.0xh;center,top&resize=1200:*.

• chewbacca: https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fspecials-images.forbesimg.com%2Fdam%2Fimageserve%2F958761228%2F960x0.jpg%3Ffit%3Dscal.

• groot: https://cdn.vox-cdn.com/thumbor/2klN1dy4JWIkR76hyxIb5G6z0P8=/1400x1050/filters:format(jpeg)/cdn.vox-cdn.com/uploads/chorus_asset/file/8378039/baby-groot-guardians.0.jpg.

• jarjar: https://boundingintocomics.com/wp-content/uploads/2019/01/2019.01.16-09.34-boundingintocomics-5c3fa35bdfa0d.png.

• Minion: https://i.pinimg.com/736x/15/26/62/152662373b8c743a65c1ae9f42b8f8a2.jpg.

The white noise image is generated by the following code.

import numpy as nprandom_state = np.random.RandomState(0)white_noise = random_state.rand(1, 3, 224, 224).astype(np.float32)

C. More details on the experimental setupExperiments are run with NVIDIA GeForce RTX 2080 Ti GPUs and machines with Intel Core i9 9940X and 128GB ofRAM. We compute nearest neighbor using FAISS (Johnson et al., 2017). We use scikit-learn’s (Pedregosa et al., 2011)implementation for one-class SVM. For efficiency, the test accuracy and OOD accuracy for one-class SVM detectionbaseline is estimated on 1000 randomly sampled examples. The experiments for evaluating the OOD accuracy are conductedwith a single run, and the few-shot learning experiments are conducted with ten runs. For these experiments, no additionaldata augmentation is applied. The code for the experiments is available at https://github.com/yangarbiter/close-category-generalization.

Dataset links. MNIST can be found on a public website2, and CIFAR-10 and CIFAR-100 can be found on another publicwebsite3.

Architechtures. We consider the convolutional neural network (CNN)4 and Wider residual network (WRN-40-10) (Zagoruyko & Komodakis, 2016) for our experiments in the pixel space. In the CNN feature space, we also considerstwo of multi-layer perceptrum (MLP) with different size. The first MLP (MLP1) has two hidden layers each with 256neurons and we ReLU as the activation function. The second MLP (MLP2) has one hidden layer with 4096 neurons. Theoutput of the first layer and the hidden layer both go through a ReLU activation function and a 50% drop rate dropout layer.

MNIST setup. We use the CNN used by Zhang et al. (2019) for training neural networks in the pixel space. The learningrate is decreased by a factor of 0.1 on the 40-th, 50-th, and 60-th epoch. We use the output of the last convolutional CNNoutput as the extracted feature.

CIFAR-10 and CIFAR-100 setup. We use Wider ResNet (WRN-40-10) (Zagoruyko & Komodakis, 2016) for trainingneural networks in the pixel space. The learning rate is decreased by a factor of 0.1 on the 40-th, 50-th, and 60-th epoch.We use the output of the first block and the last block of the WRN-40-10 as the mid layer and last layer of CNN features,respectively.

Setup for the fine-tuning experiment. We fine-tune the layers after the last convolutional layer with 40 epochs and initiallearning rate set to 0.0001. We use Adam (Kingma & Ba, 2014) as the optimizer and decrease the learning rate by a factorof 0.1 on the 20-th and 30-th epoch.

2http://yann.lecun.com/exdb/mnist/3https://www.cs.toronto.edu/˜kriz/cifar.html4CNN is retrieved from TRADES (Zhang et al., 2019) github repository https://github.com/yaodongyu/TRADES/blob/

master/models/small_cnn.py

https://hips.hearstapps.com/digitalspyuk.cdnds.net/16/46/1479397679-c-3po-see-threepio-68fe125c.jpeg?crop=1xw:1.0xh;center,top&resize=1200:*

https://hips.hearstapps.com/digitalspyuk.cdnds.net/16/46/1479397679-c-3po-see-threepio-68fe125c.jpeg?crop=1xw:1.0xh;center,top&resize=1200:*

https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fspecials-images.forbesimg.com%2Fdam%2Fimageserve%2F958761228%2F960x0.jpg%3Ffit%3Dscal

https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fspecials-images.forbesimg.com%2Fdam%2Fimageserve%2F958761228%2F960x0.jpg%3Ffit%3Dscal

https://cdn.vox-cdn.com/thumbor/2klN1dy4JWIkR76hyxIb5G6z0P8=/1400x1050/filters:format(jpeg)/cdn.vox-cdn.com/uploads/chorus_asset/file/8378039/baby-groot-guardians.0.jpg



https://boundingintocomics.com/wp-content/uploads/2019/01/2019.01.16-09.34-boundingintocomics-5c3fa35bdfa0d.png

https://boundingintocomics.com/wp-content/uploads/2019/01/2019.01.16-09.34-boundingintocomics-5c3fa35bdfa0d.png

https://i.pinimg.com/736x/15/26/62/152662373b8c743a65c1ae9f42b8f8a2.jpg



http://yann.lecun.com/exdb/mnist/

https://www.cs.toronto.edu/~kriz/cifar.html

https://github.com/yaodongyu/TRADES/blob/master/models/small_cnn.py

https://github.com/yaodongyu/TRADES/blob/master/models/small_cnn.py


dataset MNIST CIFAR-10 CIFAR-100

network structure CNN WRN-40-10 WRN-40-10optimizer SGD Adam Adambatch size 128 64 64

momentum 0.9 - -epochs 70 70 70

initial learning rate 0.01 0.01 0.01# train examples 60000 50000 50000# test examples 10000 10000 10000

# classes 10 10 20

Table 3: Experimental setup for training in the pixel space. No weight decay is applied.

mid layer last layerdataset CIFAR-10 & CIFAR-100 MNIST CIFAR-10 & CIFAR-100

network structure layers after the first block of WRN-40-10 MLP1 MLP2optimizer SGD SGD SGDbatch size 128 256 128

momentum 0.9 0.9 0.9epochs 70 70 70

initial learning rate 0.001 0.01 0.01

Table 4: Experimental setup for training in the CNN feature space space. No weight decay is applied.

D. Additional experiment resultsHow good is our approximation to the sub-Voronoi region? In DBS, we use a ball (B) to approximate the sub-Voronoiregion (V). However, this approximation may be inaccurate if an example has a differently labeled example very close in onedirection, but in another direction, every example is very far away. This can make the ball too small to cover the sub-Voronoiregion. In fact, for these three datasets that we use in this section, most OOD examples lie outside the ball. However, we stillresult in a higher OOD accuracy with our algorithm. This means that the neural network trained with our algorithm becomessmooth no just in the ball B, but it also becomes smoother outside the ball B. This can explain why AT and TRADES canimprove OOD accuracy even if their robust radii are set very small. One way to achieve better approximation is to exploresome other methods that are directional. We had investigated approaches like training on randomly sampled examples inV or using an ellipsoid instead of a ball for approximation. However, these approaches in general require solving manyquadratic or linear programs with many constraints, which is computationally infeasible. We leave the development of amore sophisticated approximation to V for future work.

D.1. An ablation study on the effect of parameter λ

Table 5 shows how training accuracy, test accuracy, and OOD accuracy change as we change the parameter λ. In the CNNfeature generated by the last convolutional layer, results with different λ perform similarly. In the pixel space, when datasetis easier like MNIST, λ “ 1 gives the best result. However, when the problem gets harder, like for CIFAR-10, we start to seesome underfitting when λ is large. In this case, smaller λ can perform better. We also see an increase in the generalizationgap (difference in training accuracy and testing accuracy) when λ gets larger. This is similar to observation observed foradversarial robust learning, where when the robust radius is increased, the clean test accuracy drops (Qin et al., 2019). Intheory, achieving robustness without reducing natural accuracy is possible (Yang et al., 2020), but how to achieve that isstill an open question. We expect that more sophisticated algorithms are developed for adversarial robustness, we can haveimprovements on the close category generalization task as well.


λ=1 λ=1/2 λ=1/3 λ=1/5 λ=1/10

MNIST

pixeltrain acc. 0.98 1.00 1.00 1.00 1.00test acc. 0.98 1.00 1.00 1.00 1.00OOD acc. 0.72 0.65 0.63 0.57 0.52

last layertrain acc. 0.99 1.00 1.00 1.00 1.00test acc. 0.99 1.00 1.00 1.00 1.00OOD acc. 0.81 0.81 0.79 0.78 0.76

CIFAR-10


mid layertrain acc. 0.30 0.56 0.68 0.71 0.77test acc. 0.30 0.56 0.66 0.70 0.75OOD acc. 0.38 0.46 0.45 0.44 0.44


CIFAR-100




Table 5: An ablation study of DBS on the parameter λ under `2 distance.

MNIST CIFAR-10 CIFAR-100pixel last layer pixel mid layer last layer pixel mid layer last layer

λ 1 1 1/2 1/10 1 1/2 1/10 1

Table 6: lambda choices for our algorithm

D.2. When does DBS perform well on OOD examples?

We bin all OOD examples based on their distance to the closest training example and show the OOD accuracy of each bin inFigure 4.

(a) MNIST (b) CIFAR-10 (c) CIFAR-100

Figure 4: OOD accuracy vs. distance to the closest training example in `2. Each dataset have their canonical class removedas the unseen category (MNIST: digit9, CIFAR-10: airplane, CIFAR-100: aquatic mammals).

D.3. Separation between OOD and in-distribution examples

To understand why off-manifold training performs better, we measure the separation between OOD and training examples(in-distribution examples). More specifically, let x be an OOD example in the representations produced by various methods,x1 be the closest OOD example to x, and x1tr be the closest training example to x; we measure the average distance ratiodistpx,x1qdistpx,x1trq

over all OOD examples. The smaller this ratio is, the more separated in- and out-of-distribution examples are.The results are shown in Table 7, and we see that the best performing algorithm in the without fine-tuning experiment in


MNIST CIFAR-10 CIFAR-100trn OOD ratio trn OOD ratio trn OOD ratio

natural 51.32 34.66 0.68 1.30 1.11 0.85 1.69 1.60 0.94mixup 12.37 8.40 0.68 0.48 0.41 0.85 0.67 0.62 0.93AT 20.73 12.65 0.61 1.21 0.97 0.80 1.49 1.36 0.91TRADES 4.99 2.94 0.59 0.69 0.51 0.73 0.85 0.78 0.93DBS 3.51 2.16 0.62 0.96 0.69 0.71 1.50 1.28 0.85

Table 7: Each dataset have their canonical class removed as the unseen category (MNIST: digit9, CIFAR-10: airplane,CIFAR-100: aquatic mammals). This table shows: (i) trn = avg. dist. between OOD examples and their closest trainingexample, (ii) OOD = avg. dist. between OOD examples and their closest OOD example excluding itself, (iii) OOD:trn =ratio between these two distances. When the ratio is small, OOD examples are closer to same-label than different-labelexamples. Smaller ratio is better; best in bold.

each dataset has the smallest ratio. These results show that the representations obtain through off-manifold trainings adaptbetter to unseen categories by better separating in- and out-of-distribution examples in the feature space.

E. Pseudocode for DBS

Algorithm 1 txi, yiuNi“1, λ, β, η, thresh, T

εmaxi “ λ ¨minxjPX‰yi12distpxi,xjq

εi Ð 0. @i P rN sfor # epoch do

for i “ 1..N doεi Ð εi ` ηεi “ maxpεi, ε

maxi q

δi Ð 0for j “ 1..T doδi Ð α ¨ signp∇δiDKLpfθpxi ` δiq, fθpxiqqproject δi onto Bp0, εiq

end forif DKLpfθpxi ` δiq, fθpxiqq ą thresh thenεi Ð εi ´ 2 ˚ η

end ifupdate θ to minimize `pfθpxiq, yiq ` β ˚DKLpfθpxi ` δiq, fθpxiqq

end forend for

E.1. Using other classes as the unseen category.

We conduct experiments using other classes as the unseen category. The hyperparameters are the same as the experiments inSection 4. We assign different classes for each dataset and evaluate their training, testing, and OOD accuracy. For MNISTand CIFAR-10, we set each of the 10 classes as the unseen category. The results are in Table 10, 11, and 12. For CIFAR-100,we pick top 5 super-classes in alphabetical order as the unseen category, leading to 5 experiments in total, and the results arein Table 13.

These experiments are also conducted on few-shot learning tasks. The experiments are repeated 10 times, and the mean andstandard error are recorded. Table 15, 17, and 19 show the result of using 1-NN as the classifier. Table 16, 18, and 20 showthe result of using neural network as the classifier.

In addition, for CIFAR-100, we also experiment with fine-labeling (instead of coarse labeling used in Section 4). Werandomly select a category in each super-class as the unseen categories. There will be a total of 20 unseen categories and 80seen categories. The results are in Table 14.

Table 8 show the average rank of each method for each dataset across different unseen categories. In the pixel space, we seethat AT, TRADES, and DBS generally have a higher OOD accuracy than natural and mixup. In the CNN feature space, DBS


natural mixup AT TRADES DBS

MNIST pixel 4.2 4.8 1.8 1.3 2.9last layer 4.7 2.3 3.9 2.7 1.4

CIFAR-10pixel 3.4 4.0 2.0 1.2 2.1mid layer 3.8 2.2 3.5 3.7 1.7last layer 1.9 3.9 4.9 3.1 1.2

CIFAR-100pixel 4.2 4.8 2.6 1.2 2.2mid layer 4.2 1.6 4.2 2.8 2.2last layer 3.0 4.0 5.0 1.6 1.4

average 3.7 3.6 3.5 2.2 1.9

Table 8: Average rank for OOD accuracy across different unseen categories. The numbers are lower the better.

generally performs the best or at least equally well to other algorithms. AT and TRADES did not perform that well in theCNN feature space possibly due to a lack of a method to adapt their perturbation distance to different feature spaces. Theseresults align with our previous conclusion that off-manifold training performs better than on-manifold training in OODaccuracy, and DBS can perform better than other off-manifold training methods in CNN feature space.

Table 9 show the average rank of each method for few-shot learning tasks. We see that overall, DBS performs the bestand TRADES performs the second best. AT and mixup follow behind and natural performs the worst. This result is alsoconsistent with what we report in Section 4.

# shots natural mixup AT TRADES DBS

1-NN

MNIST

10 4.50 1.10 3.30 2.10 4.0020 4.50 1.10 3.40 2.10 3.90100 4.50 1.00 3.50 2.20 3.801000 4.40 1.10 3.30 2.20 4.00

CIFAR-10

10 4.80 3.90 3.33 2.00 1.0020 4.80 3.90 3.33 2.00 1.00100 4.70 4.20 3.10 2.00 1.001000 4.70 4.30 2.90 2.11 1.00

CIFAR-100

10 4.40 3.80 2.20 3.60 1.0020 4.60 3.40 2.20 3.80 1.00100 4.80 3.20 2.20 3.80 1.001000 4.50 3.40 2.00 4.10 1.00

average 4.60 2.87 2.90 2.67 1.98

neural network

MNIST

10 5.00 2.70 3.70 1.40 2.2020 5.00 2.50 3.60 1.40 2.50100 5.00 2.20 3.80 1.70 2.301000 5.00 2.50 3.20 2.00 2.30

CIFAR-10

10 4.20 4.30 3.50 2.00 1.0020 3.70 5.00 3.30 2.00 1.00100 3.70 5.00 3.00 2.30 1.001000 3.50 5.00 3.10 2.40 1.00

CIFAR-100

10 3.83 2.33 3.00 4.33 1.5020 3.33 2.83 3.50 4.00 1.33100 2.33 3.50 3.67 4.00 1.501000 2.50 3.33 3.83 4.00 1.33

average 3.92 3.43 3.43 2.63 1.58

Table 9: Average rank for few-shot learning tasks across different unseen categories. The numbers are lower the better.


natural mixup AT TRADES DBSunseen category

digit 0



digit 1



digit 2



digit 3



digit 4



digit 5



digit 6



digit 7



digit 8



digit 9



Table 10: The result of using other digits as the unseen category for MNIST.



airplane




automobile




bird




cat




deer




Table 11: The result of using the first five classes as the unseen category for CIFAR-10.



dog




frog




horse




ship




truck




Table 12: The result of using the last five classes as the unseen category for CIFAR-10.



aquatic mammals




fish




flowers




food containers




fruit and vegetables




Table 13: The result of using a different unseen category for CIFAR-100.


randomly take one categoryfrom each superclass




Table 14: The result of using different unseen categories for CIFAR-100. Here, CIFAR-100 is considered under fine labeling,which has a total of 100 classes. There are 20 super-classes in CIFAR-100 and each super-class is consist of 5 classes. Werandomly remove 1 class from each of the 20 super-classes, resulting in 20 unseen categories.


ood acc (10) ood acc (20) ood acc (100) ood acc (1000)

digit 0

natural 0.60 ˘ 0.01 0.77 ˘ 0.01 0.93 ˘ 0.00 0.99 ˘ 0.00mixup 0.75 ˘ 0.01 0.86 ˘ 0.01 0.96 ˘ 0.00 0.99 ˘ 0.00AT 0.60 ˘ 0.01 0.77 ˘ 0.01 0.93 ˘ 0.00 0.99 ˘ 0.00TRADES 0.64 ˘ 0.01 0.80 ˘ 0.01 0.94 ˘ 0.00 0.99 ˘ 0.00DBS 0.62 ˘ 0.01 0.79 ˘ 0.01 0.94 ˘ 0.00 0.99 ˘ 0.00

digit 1


digit 2


digit 3


digit 4


digit 5


digit 6


digit 7


digit 8


digit 9


Table 15: The result of using different digits as the unseen category for MNIST for the few-shot learning task with 1-NN.



digit 0


digit 1


digit 2


digit 3


digit 4


digit 5


digit 6


digit 7


digit 8


digit 9


Table 16: The result of using different digits as the unseen category for MNIST for the few-shot learning task with a neuralnetwork.



airplane


automobile


bird


cat


deer


dog


frog


horse


ship


truck


Table 17: The result of using different digits as the unseen category for CIFAR-10 for the few-shot learning task with 1-NN.



airplane


automobile


bird


cat


deer


dog


frog


horse


ship


truck


Table 18: The result of using different classes as the unseen category for CIFAR-10 for the few-shot learning task with aneural network.



aquatic mammals


fish


flowers


food containers




Table 19: The result of using different classes as the unseen category for CIFAR-100 for the few-shot learning task with1-NN.


aquatic mammals


fish


flowers


food containers




Table 20: The result of using different classes as the unseen category for CIFAR-100 for the few-shot learning task with aneural network.

Documents

Close Category GeneralizationClose Category Generalization to unseen categories. Our goal also di ers from adver-sarial robustness because we focus on natural inputs (e.g., real images)