9
LADA: Look-Ahead Data Acquisition via Augmentation for Active Learning Yoon-Yeong Kim, Kyungwoo Song, JoonHo Jang, Il-Chul Moon Department of Industrial and Systems Engineering KAIST Daejeon, Republic of Korea {yoonyeong.kim,gtshs2,adkto8093,icmoon}@kaist.ac.kr Abstract Active learning effectively collects data instances for train- ing deep learning models when the labeled dataset is limited and the annotation cost is high. Besides active learning, data augmentation is also an effective technique to enlarge the lim- ited amount of labeled instances. However, the potential gain from virtual instances generated by data augmentation has not been considered in the acquisition process of active learn- ing yet. Looking ahead the effect of data augmentation in the process of acquisition would select and generate the data in- stances that are informative for training the model. Hence, this paper proposes Look-Ahead Data Acquisition via aug- mentation, or LADA, to integrate data acquisition and data augmentation. LADA considers both 1) unlabeled data in- stance to be selected and 2) virtual data instance to be gen- erated by data augmentation, in advance of the acquisition process. Moreover, to enhance the informativeness of the vir- tual data instances, LADA optimizes the data augmentation policy to maximize the predictive acquisition score, result- ing in the proposal of InfoMixup and InfoSTN. As LADA is a generalizable framework, we experiment with the various combinations of acquisition and augmentation methods. The performance of LADA shows a significant improvement over the recent augmentation and acquisition baselines which were independently applied to the benchmark datasets. Introduction Large-scale datasets in the big data era have opened the blooming of artificial intelligence, but the data labeling re- quires significant efforts from human annotators. Therefore, an adaptive sampling, i.e. Active Learning, has been devel- oped to select the most informative data instances in learn- ing the decision boundary (Cohn, Ghahramani, and Jordan 1996; Tong 2001; Settles 2009). This selection is difficult because it is influenced by the learner and the dataset at the same time. Hence, the understanding of the relation between the learner and the dataset has become the components of active learning, which queries the next training example by the informativeness for learning the decision boundary. Besides active learning, data augmentation is another source of providing virtual data instances to train models. The labeled data may not cover the full variation of the gen- eralized data instances, so the learner has used the data aug- mentation; particularly in the vision community (Liu and Under Review (a) Max Entropy (b) Mixup applied after Max Entropy (c) LADA without learnable augmentation (d) LADA with learnable augmentation Figure 1: Illustration of different scenarios for applying ac- quisition and augmentation; with selected instances (?), vir- tual instances (×), current decision boundary (solid line), and updated decision boundary (dashed line). (a) Max En- tropy selects the instances near the decision boundary and updates the boundary. (b) If we augment the selected in- stances afterward, the virtual instances may not be useful for updating the boundary. (c) By considering the potential gain of the virtual instances from augmentation in advance of the acquisition, we enhance the informativeness of both the selected and augmented instances for learning the deci- sion boundary. (d) Moreover, we train the augmentation pol- icy to maximize the acquisition score of the virtual instances to generate useful instances. Ferrari 2017; Perez and Wang 2017; Cubuk et al. 2019). Conventional data augmentation has been a simple transfor- mation of labeled data instances, i.e. flipping, rotating, etc. Recently, the data augmentation has expanded to become a deep generative model, such as Generative Adversarial Net- works (GAN) (Goodfellow et al. 2014) or Variational Au- toencoder (VAE) (Kingma and Welling 2014), that generate virtual examples. Since the conventional augmentations and the generative model-based augmentations perform the Vic- inal Risk Minimization (VRM)(Chapelle et al. 2001), they assume that the virtual data instances in the vicinity share the same label, which leads to limiting the feasible vicin- ity. To overcome the limited vicinity of VRM, Mixup and its arXiv:2011.04194v3 [cs.LG] 17 Nov 2020

LADA: Look-Ahead Data Acquisition via Augmentation for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

LADA: Look-Ahead Data Acquisition via Augmentation for Active Learning

Yoon-Yeong Kim, Kyungwoo Song, JoonHo Jang, Il-Chul MoonDepartment of Industrial and Systems Engineering

KAISTDaejeon, Republic of Korea

{yoonyeong.kim,gtshs2,adkto8093,icmoon}@kaist.ac.kr

AbstractActive learning effectively collects data instances for train-ing deep learning models when the labeled dataset is limitedand the annotation cost is high. Besides active learning, dataaugmentation is also an effective technique to enlarge the lim-ited amount of labeled instances. However, the potential gainfrom virtual instances generated by data augmentation hasnot been considered in the acquisition process of active learn-ing yet. Looking ahead the effect of data augmentation in theprocess of acquisition would select and generate the data in-stances that are informative for training the model. Hence,this paper proposes Look-Ahead Data Acquisition via aug-mentation, or LADA, to integrate data acquisition and dataaugmentation. LADA considers both 1) unlabeled data in-stance to be selected and 2) virtual data instance to be gen-erated by data augmentation, in advance of the acquisitionprocess. Moreover, to enhance the informativeness of the vir-tual data instances, LADA optimizes the data augmentationpolicy to maximize the predictive acquisition score, result-ing in the proposal of InfoMixup and InfoSTN. As LADA isa generalizable framework, we experiment with the variouscombinations of acquisition and augmentation methods. Theperformance of LADA shows a significant improvement overthe recent augmentation and acquisition baselines which wereindependently applied to the benchmark datasets.

IntroductionLarge-scale datasets in the big data era have opened theblooming of artificial intelligence, but the data labeling re-quires significant efforts from human annotators. Therefore,an adaptive sampling, i.e. Active Learning, has been devel-oped to select the most informative data instances in learn-ing the decision boundary (Cohn, Ghahramani, and Jordan1996; Tong 2001; Settles 2009). This selection is difficultbecause it is influenced by the learner and the dataset at thesame time. Hence, the understanding of the relation betweenthe learner and the dataset has become the components ofactive learning, which queries the next training example bythe informativeness for learning the decision boundary.

Besides active learning, data augmentation is anothersource of providing virtual data instances to train models.The labeled data may not cover the full variation of the gen-eralized data instances, so the learner has used the data aug-mentation; particularly in the vision community (Liu and

Under Review

(a) Max Entropy (b) Mixup appliedafter Max Entropy

(c) LADA withoutlearnable augmentation

(d) LADA withlearnable augmentation

Figure 1: Illustration of different scenarios for applying ac-quisition and augmentation; with selected instances (?), vir-tual instances (×), current decision boundary (solid line),and updated decision boundary (dashed line). (a) Max En-tropy selects the instances near the decision boundary andupdates the boundary. (b) If we augment the selected in-stances afterward, the virtual instances may not be usefulfor updating the boundary. (c) By considering the potentialgain of the virtual instances from augmentation in advanceof the acquisition, we enhance the informativeness of boththe selected and augmented instances for learning the deci-sion boundary. (d) Moreover, we train the augmentation pol-icy to maximize the acquisition score of the virtual instancesto generate useful instances.

Ferrari 2017; Perez and Wang 2017; Cubuk et al. 2019).Conventional data augmentation has been a simple transfor-mation of labeled data instances, i.e. flipping, rotating, etc.Recently, the data augmentation has expanded to become adeep generative model, such as Generative Adversarial Net-works (GAN) (Goodfellow et al. 2014) or Variational Au-toencoder (VAE) (Kingma and Welling 2014), that generatevirtual examples. Since the conventional augmentations andthe generative model-based augmentations perform the Vic-inal Risk Minimization (VRM)(Chapelle et al. 2001), theyassume that the virtual data instances in the vicinity sharethe same label, which leads to limiting the feasible vicin-ity. To overcome the limited vicinity of VRM, Mixup and its

arX

iv:2

011.

0419

4v3

[cs

.LG

] 1

7 N

ov 2

020

variants have been proposed by interpolating multiple datainstances (Zhang et al. 2017). The pair of interpolated fea-tures and labels, or the Mixup instance, becomes a virtualinstance to enlarge the support of the training distribution.

Data augmentation and active learning intend to over-come the scarcity of labeled dataset in different directions.First, active learning emphasizes the optimized selection ofthe unlabeled real-world instances for the oracle query, sothere is no consideration on the benefit of the virtual datageneration. Second, the data augmentation focuses on gen-erating an informative virtual data instance without interven-ing on the data selection stage, and without the potential as-sistance from oracle. These differences motivate us to pro-pose the Look-Ahead Data Acquisition via augmentation, orLADA framework.

LADA looks ahead the effect of data augmentation in ad-vance of the acquisition process, and LADA selects datainstances by considering both unlabeled data instances andvirtual data instances generated by data augmentation, at thesame time. Whereas the conventional acquisition functiondoes not consider the potential gain of the data augmenta-tion, LADA contemplates the informativeness of the virtualdata instances by integrating data augmentation into the ac-quisition process. Figure 1 describes the different behaviorof LADA and conventional acquisition functions when ap-plying data augmentation to active learning.

Here are our contributions from the methodological andthe experimental perspectives. First, we propose a general-ized framework, named LADA, that looks ahead the acquisi-tion score of the virtual data instance to be augmented, in ad-vance of the acquisition. Second, we train the data augmen-tation policy to maximize the acquisition score to generateinformative virtual instances. Particularly, we propose twodata augmentation methods, InfoMixup and InfoSTN, whichare trained by the feedback of acquisition scores. Third,we substantiate the proposed framework by implementingthe variations of acquisition-augmentation frameworks withknown acquisitions and augmentation methods.

PreliminariesProblem FormulationThis paper trains a classifier network, fθ, with dataset Xwhile our scenario is differentiated by assuming X =XU∪XL and |XU | � |XL|. Here, XU is a set of unlabeleddata instances, and XL is a labeled dataset. Given these no-tations, a data augmentation function, faug(x; τ) : X →V (X ), transforms a data, x ∈ X , into a modified data,x ∈ V (X ); where τ is a parameter describing the pol-icy of transformation, and V (X ) is the vicinity set of X(Chapelle et al. 2001). On the other hand, a data acquisi-tion function, facq(x; fθ) : XU → R, calculates a score ofeach data instance, x ∈ XU , based on the current classifier,fθ; and facq represents the instance selection strategy in thelearning procedure of fθ with the instance, x ∈XU .

Data AugmentationIn the conventional data augmentations, τ in faug(x; τ) in-dicates the predefined degree of rotating, flipping, cropping,

etc. τ is manually chosen by the modeler to describe thevicinity of each data instance.

Another approach of modeling τ is utilizing the feedbackfrom the current classifier network, fθ. Spatial TransformerNetwork (STN) is a transformer to generate a virtual exam-ple by training τ to minimize the cross-entropy (CE) loss ofthe transformed data (Jaderberg et al. 2015):

τ∗ = argminτ

CE(fθ(fSTNaug (x; τ)), y), (1)

where y is the ground-truth label of the data instance, x.Recently, Mixup-based data augmentations generate a vir-

tual data instance from the vicinity of a pair of data in-stances. In Mixup, τ becomes the mixing policy of two datainstances, xi and xj (Zhang et al. 2017):

fMixupaug (xi, xj ; τ) = λxi+(1−λ)xj , λ ∼ Beta(τ, τ), (2)

where the labels are also mixed by the proportion λ. WhileEq.(2) corresponds to the input feature mixture, ManifoldMixup mixes the hidden feature maps from the middle ofneural networks to learn smoother decision boundary at mul-tiple levels of representations (Verma et al. 2018). Whereasτ is a fixed value without any learning process, AdaMixuplearns τ by adopting a discriminator, ϕada (Guo, Mao, andZhang 2019):

τ∗ = argmaxτ

log P(ϕada(fMixupaug (xi, xj ; τ)) = 1)

+ log P(ϕada(xi) = 0) + log P(ϕada(xj) = 0). (3)

Active LearningWe focus on the pool-based active learning with uncertaintyscore (Settles 2009). Given this scope of active learning, thedata acquisition function measures the utility score of theunlabeled data instances, i.e. x∗ = argmaxx facq(x; fθ).

The traditional acquisition functions measure the predic-tive entropy, fEntacq (x; fθ) = H[y|x; fθ] (Shannon 1948);or the variation ratio, fV aracq (x; fθ) = 1 − maxyP(y|x; fθ)(Freeman 1965). The recent acquisition function calculatesthe hypothetical disagreement by fθ on a data instance,fBALDacq (x; fθ) = H[y|x; fθ] − EP(θ|Dtrain)[H[y|x; fθ]](Houlsby et al. 2011).

Besides the classifier network, fθ, additional modules areapplied to measure the acquisition score. To find the mostdissimilar instance in XU compared to XL, a discriminator,ϕV AAL, is introduced to estimate the probability of belong-ing to XU (Sinha, Ebrahimi, and Darrell 2019):

fV AALacq (x;ϕV AAL) = P(x ∈XU ;ϕV AAL). (4)

To diversely select uncertain data instances, the gradient em-bedding from the pseudo label, y, is used in k-MEANS++seeding algorithm (Ash et al. 2020):

fBADGEacq (x; fθ) =∂

∂θoutCE(fθ(x), y). (5)

Active Learning with Data AugmentationThere are a few prior works in integrating active learning anddata augmentation effectively. Bayesian Generative Active

Deep Learning (BGADL) integrates acquisition and aug-mentation by selecting data instances via facq , then aug-menting the selected instances via faug , which is VAE, after-ward (Tran et al. 2019). However, BGADL limits the vicin-ity to preserve the labels, and BGADL demands on largelabeled instances to train generative models. More impor-tantly, BGADL does not consider the potential gain of dataaugmentation in the process of acquisition.

MethodologyA contribution of this paper is proposing an integratedframework of data augmentation and acquisition, so we startfrom formulating such a framework. Afterward, we proposean integrated function for acquisition and augmentation asan example of the implemented framework.

Look-Ahead Data Acquisition via AugmentationSince we look ahead the acquisition score of augmented datainstances, it is natural to integrate the functionalities of ac-quisitions and augmentations. This paper proposes a Look-Ahead Data Acquisition via augmentation, or LADA frame-work. Figure 2a depicts the LADA framework, which con-sists of the data augmentation component and the acquisi-tion component. The goal of LADA is enhancing the infor-mativeness of both 1) real-world data instance, which is un-labeled at current, but will be labeled by the oracle in thefuture; and 2) virtual data instance, which will be generatedfrom the unlabeled data instances that are selected. This goalis achieved by looking ahead of their acquisition scores be-fore actual selections for the oracle annotations.

Specifically, LADA trains the data augmentation function,faug(x; τ), to maximize the acquisition score of the trans-formed data instance of xU before the oracle annotations.Eq.(6) specifies the learning objectives of the augmentationpolicy via the feedback from acquisition.

τ∗ = argmaxτ

facq(faug(xU ; τ); fθ). (6)

With the optimal τ∗ corresponding to xU , facq calculates theacquisition score of xU (see Eq.(7)), and the score also con-siders the utility of the augmented instance, faug(xU ; τ∗):

x∗U = argmaxxU∈XU

[facq(xU ; fθ)+facq(faug(xU ; τ∗); fθ)] (7)

Whereas the proposed LADA framework is a general-ized framework that can adopt the various types of acquisi-tion and augmentation functions, this section mainly adoptsMixup for faug , i.e. fMixup

aug and Max Entropy for facq ,i.e. fEntacq . To begin with, we introduce an integrated sin-gle function to substitute the composition of functions asfinteg = facq ◦ faug(xU ) = facq(faug(xU ; τ); fθ) for gen-erality.

Integrated Augmentation and Acquisition:InfoMixupAs we introduce LADA with finteg to look ahead the ac-quisition score of the virtual data instances, finteg can be asimple composition of well-known acquisition functions andaugmentation functions where the policy of augmentation is

fixed. However, this does not enhance the informativeness ofthe virtual data instances. Hence, we propose the integrationwhere the policy of data augmentation is optimized to max-imize the acquisition score, within a single function. Here,we introduce InfoMixup as a learnable data augmentation.

Data Augmentation First, we propose InfoMixup, whichis an adaptive version of Mixup to integrate the data aug-mentation into active learning. InfoMixup learns its mix-ing policy, τ , by the objective functions Eq.(8), where λ ∼Beta(τ∗, τ∗) maximizes the acquisition score of the virtualdata instance resulting from mixing two randomly paireddata instances, xi and x′i:

fEntMix(xi, x′i; τ, fθ) = fEntacq (fMixup

aug (xi, x′i; τ); fθ)

τ∗ = argmaxτ

fEntMix(xi, x′i; τ,fθ). (8)

InfoMixup is the starting ground where we correlate the dataaugmentation guided by the data acquisition from the per-spective of the predictive classifier entropy.

We adopt Manifold Mixup as the data augmentation at thehidden layer. Specifically, the pair of (xi, x′i) ∈ XU is pro-cessed through the current classifier network, fθ, until thepropagation reaches the randomly selected k-th layer1. Af-terwards, the k-th feature maps (hk(xi), h

k(x′i)) are con-catenated and processed by the policy generator network,πφ, to predict τ∗i that maximizes the acquisition score.

Data Augmentation Policy Learning As we formulatethe Mixup based augmentation, we propose a policy gener-ator network, πφ, to perform the amortized inference on theBeta distribution of InfoMixup. While we provide the detailsof the policy network in Appendix A.2 and Figure 2b, weformulate this inference process as Eq.(9) and Eq.(10).

hk(xi) = f0:kθ (xi), h

k(x′i) = f0:kθ (x′i) (9)

τi = πφ([hk(xi), h

k(x′i)]) (10)

=MLPφ([hk(xi), h

k(x′i)]).

To train the parameters, φ, of the policy generator network,π, the paired features are mixed-up with N sampled λi’s.Using a n-th sampled λni from the Beta distribution inferredby πφ, the feature maps hk(xi) and hk(x′i) are mixed to pro-duce hk,nm (xi, x

′i) as the below:

λni ∼ Beta(τi, τi), (11)

hk,nm (xi, x′i) = λni h

k(xi) + (1− λni )hk(x′i). (12)

By processing hk,nm for the rest layers of the classifier net-work, the predictive class probability of the mixed featuresis obtained as yni = fk:L

θ (hk,nm (xi, x′i)). In order to generate

a useful virtual instance through InfoMixup, the policy gen-erator network has a loss function to minimize the negativevalue of the predictive entropy as Eq.(13), and the predictiveentropy is a component of facq , which provides the incentive

1Throughout this paper, we denote the forward path from theith layer to the jth layer of the classifier network as f i:jθ , where 0-th is the input layer and L-th is the output layer. Hence, fθ = f0:L

θ .

(a) Overview of LADA framework (b) Training process of the policy generator network, πφ,in LADA with Max Entropy and Manifold Mixup

Figure 2: Overall algorithm of LADA and training process of data augmentation. (a) facq considers both the unlabeled instancesand the virtual instances generated from faug . Moreover, faug is optimized with the feedback of facq . (b) The parameters ofthe classifier network fθ are fixed during the backpropagation for the policy generator network.

for the integration of acquisition and augmentation. The gra-dient of this loss function is calculated by averaging the Nentropy values of the replicated mixed features. It should benoted that Eq.(13) embed φ in the generation of hk,nm (xi, x

′i),

so the gradient can be estimated via the Monte-Carlo sam-pling (Hastings 1970). Figure 2b illustrates the forward andthe backward paths for the training process of the policy gen-erator network.

∂φLπ =

∂φ(− 1

N

N∑n

facq(hk,nm (xi, x

′i); f

k:Lθ )) (13)

=∂

∂φ(− 1

N

N∑n

H[yni |hk,nm (xi, x′i); f

k:Lθ ])

In the backpropagation, we have a process of sampling λisfrom the Beta distribution parameterized by τi. To enable thebackpropagation signals to pass by, we follow the reparam-eterization technique of the optimal mass transport (OMT)gradient estimator, which utilizes the implicit differentiation(Jankowiak and Obermeyer 2018; Jankowiak and Karaletsos2019). Appendix B provides the details of our OMT gradientestimator in the backpropagation process.

Data Acquisition by Learned Policy After optimizingthe mixing policy, τ∗i , for i-th pair of unlabeled data in-stances, (xi, x′i), we calculate the joint acquisition score ofthe data pair by aggregating the individual acquisition scoresof 1) xi, 2) x′i, and 3) their mixed feature maps, hk,nm (xi, x

′i)

as the below:

yi = fθ(xi), y′i = fθ(x

′i), y

ni = fk:L

θ (hk,nm (xi, x′i)) (14)

facq((xi, x

′i); fθ

)= H[yi|xi; fθ] +H[y′i|x′i; fθ] (15)

+1

N

N∑n

H[yni |hk,nm (xi, x′i); f

k:Lθ ]

As we calculate the acquisition score by including the pre-dictive entropy of the InfoMixup feature map, the acquisi-tion is influenced by the data augmentation. More impor-tantly, this integration is reciprocal because the optimal aug-mentation policy of InfoMixup comes from the acquisition

score. This reciprocal relation is an example of motivat-ing the LADA framework by overcoming the separation be-tween the augmentation and the acquisition.

If we take InfoMixup as an example of LADA, we showthat InfoMixup generates a virtual sample with a high predic-tive entropy in the class estimation, which could be regardedas a decision boundary region that is not clearly explored,yet. The unexplored region is identified by the optimal pol-icy of τ∗ in the acquisition.

Here, we introduce a pipelined variant in finteg to em-phasize the worth of the integration. One possible variationis incorporating Mixup-based data augmentation and acqui-sition function as a two-step model, where 1) the acquisitionfunction selects the data instances whose individual scoresare the highest, 2) then Mixup is afterward applied to theselected instances. However, this method may increase thecriteria of individual data instances laid in the first and thesecond terms of Eq.(15), but it may not optimize the criteriaof their mixing process laid in the last part of Eq.(15) sinceit has not considered the effect of Mixup in the selection pro-cess. This may not enhance the informativeness of the virtualdata instances. We compare this variation with LADA in theExperiments Section.

Training Set Expansion through AcquisitionWe assume that we start the jth active learning iteration withalready acquired labeled dataset X j

L . With the allowed bud-get per acquisition as k, we acquire top-k2 pairs, i.e. XS

among the subsets, X ′S ⊂XU ×XU , with |X ′

S | = k2 .

XS = argmaxX ′

S⊂XU×XU

∑(xi,x′

i)∈X ′S

facq((xi, x′i); fθ) (16)

At this moment, oracle annotates the true labels on XS .Also, we have a virtual instance dataset, XM , generated byInfoMixup with the optimal mixing policy, τ∗:

XM =⋃

(xi,x′i)∈XS

{λif0:Lθ (xi) + (1− λi)f0:k

θ (x′i)}, (17)

where λi ∼ Beta(τ∗i , τ∗i ). Here, τ∗ is dynamically inferred

by the neural network of πφ per each pair.

Algorithm 1 LADA with Max Entropy and Manifold Mixup

Input: Labeled dataset X 0L , Classifier fθ

1: for j = 0, 1, 2, . . . do . active learning2: Randomly sample X pool

U ⊂XU

3: Get X ′poolU which is randomly shuffled X pool

U4: Randomly chose the layer index k of the fθ5: for i = 0, 1, 2, . . . do . for the unlabeled instances6: xi ∈X pool

U , x′i ∈X ′poolU

7: Get hk(xi), hk(x′i) of xi, x′i as Eq.(9)8: for q = 0, 1, 2, . . . do . training of πφ9: τ qi = πφ([h

k(xi);hk(x′i)])

10: Calculate ∂∂φLπ as Eq.(13)

11: φ← φ− ηπ ∂∂φLπ

12: if Lπ is minimal then13: τ∗i = τ qi14: end if15: end for16: end for17: Select and query the dataset, XS , as Eq.(16)18: X j+1

L = X jL ∪XS

19: for t = 0, 1, 2, . . . do . training of fθ20: λi ∼ Beta(τ∗i , τ

∗i ) for (xi, x′i) ∈XS

21: Get virtual dataset, XM , as Eq.(17)22: Calculate Lf as Eq.(18) ∼ (19)23: θ ← θ − ηf ∂

∂θLf24: end for25: end for

Up to this phase, our training dataset becomes X j+1L =

X jL ∪XS and XM . Our proposed algorithm, described in

Algorithm 1, utilizes XM for this active learning iterationonly, with various λis sampled at each training epoch. Theclassifier network’s parameter, θ, is learned via the gradientof the cross-entropy loss,

Lf = CE(fθ(xi), yi)xi∈X j+1L

(18)

+ CE(fk:Lθ (xi), yi)xi∈XM

, (19)

where yi is the corresponding ground-truth label annotatedfrom the oracle for Eq.(18); or the mixed label according tothe mixing policy for Eq.(19).

LADA with Various Augmentation-AcquisitionSince we propose the integrated framework of acquisitionfunction and data augmentation to look ahead the informa-tiveness of the data, we can use various acquisition functionsand data augmentations in our LADA framework. For exam-ple, we may substitute the Max Entropy, which is the feed-back of the acquisition function to the data augmentation inInfoMixup, with another simple feedback, Var Ratio. Also,if we apply the VAAL acquisition function, LADA withVAAL trains the generator network, πφ, to maximize thediscriminator’s indication on the unlabeled dataset, P(x ∈XU ;ϕ

V AAL), for the generated instances.Similarly, we may substitute the data augmentation of In-

foMixup with Spatially Transform Networks (STN) (Jader-

berg et al. 2015), a.k.a. InfoSTN. STN may be trained with asubset of unlabeled data as input to maximize their predic-tive entropy when propagated to the current classifier net-work. The score to pick the most informative data is formu-lated as fSTNacq (xi; fθ) = H[ySTNi |xSTNi ], where xSTNi isthe spatially transformed output of the data xi, and ySTNiis the corresponding prediction by the current classifier net-work. We provide more details in Appendix C.

ExperimentsBaselines and DatasetsThis section denotes the proposed framework as LADA,and we specify the instantiated data augmentation and ac-quisition by its subscript, i.e. the proposed InfoMixup asLADAEntMix which adopts Max Entropy as data acquisi-tion and Mixup as data augmentation to select and generateinformative samples. If we change the entropy measure tothe Var Ratio or the discriminator logits of VAAL, it resultsin the subscript of VarMix or VaalMix, respectively. Also, ifwe change the augmentation policy to the STN network, thesubscript becomes EntSTN.

We compare our models to 1) Coreset (Sener and Savarese2018); 2) BADGE (Ash et al. 2020); and 3) VAAL (Sinha,Ebrahimi, and Darrell 2019) as the baselines for active learn-ing. We also include some data augmented active learn-ing: 1) BGADL, 2) Manifold Mixup; and 3) AdaMixup.Here, BGADL is an integrated data augmentation and ac-quisition method, but it should be noted that BGADL hasno learning mechanism in the augmentation from the feed-back of acquisition. We also add ablated baselines to seethe effect of learning τ , so we introduce the fixed τ caseas LADAfixed. The classifier network, fθ, adopts Resnet-18 (He et al. 2016), and the policy generator network, πφ,consists of a much smaller neural network. Appendix A.2provides more details on the networks and their training.

We experiment the above-mentioned models with threebenchmark datasets: FashionMNIST (Fashion) (Xiao, Ra-sul, and Vollgraf 2017), SVHN (Netzer et al. 2011), andCIFAR-10 (Krizhevsky, Hinton et al. 2009). Throughout ourexperiments, we repeat the experiments for five times to val-idate the statistical significance, and the maximum acquisi-tion iteration is limited to 100. More details about the treat-ment on each dataset are in Appendix A.1.

We evaluate the models under the pool-based active learn-ing scenario. We assume that the model has 20 training in-stances, which are randomly chosen and balanced. As theactive learning iteration progresses, we acquire 10 additionaltraining instances at each iteration, and we use the sameamount of oracle queries for all models, which results inselecting top-5 pairs when adopting Mixup as data augmen-tation in the LADA framework.

Quantitative Performance EvaluationsTable 3 shows the average test accuracy, and the accuracy ofeach replication represents the best accuracy over the acqui-sition iterations. Since we introduce the generalizable frame-work, Table 3 separates the performances by the instanti-ated acquisition functions. The group of baselines does not

Method Fashion SVHN CIFAR-10 Time Param.B

asel

ines

Random 80.96±0.62 73.92±2.80 35.27±1.36 1 -BALD 80.99±0.59 75.66±2.07 34.71±2.28 1.36 -Coreset 78.47±0.30 68.57±3.13 28.25±0.89 1.54 -BADGE 80.94±0.98 70.89±1.91 28.60±1.17 1.31 -BGADL 78.42±1.05 63.50±1.56 35.08±2.20 4.69 13M

Ent

ropy

-bas

ed

Max Entropy 80.93±1.85 72.57±0.76 34.97±0.71 1.01 -Ent w.ManifoldMixup 82.31±0.38 72.69±1.29 35.88±0.85 1.03 -

Ent w.AdaMixup 81.30±0.83 73.00±0.39 35.67±1.75 1.03 5KLADAfixed

EntMix 83.08±1.34 75.73±1.48 36.34±0.88 1.06 -LADAEntMix 83.67±0.29 76.55±0.31 37.04±1.34 1.32 77KLADAfixed

EntSTN 82.37±0.58 72.08±1.67 35.55±1.34 1.02 5KLADAEntSTN 81.83±0.55 73.80±0.81 36.18±0.69 1.20 5K

VAA

L-b

ased VAAL 82.67±0.29 75.01±0.66 39.82±0.86 3.55 301K

LADAfixedVaalMix 82.63±0.29 76.83±1.05 44.42±2.12 3.56 301K

LADAVaalMix 82.60±0.49 77.92±0.51 44.56±1.40 3.60 378K

Var

Rat

io-b

ased

Var Ratio 81.05±0.18 74.07±1.87 34.99±0.73 1.01 -LADAfixed

VarMix 83.11±0.66 76.01±2.64 35.98±1.68 1.06 -LADAVarMix 84.47±0.89 76.09±0.94 36.84±0.51 1.33 77K

Table 1: Comparison of test accuracy, the run-time of 1 iteration of acquisition (Time), and thenumber of parameters (Param.). The best performance in each category is indicated as boldface.The run-time is calculated as the ratio to the Random acquisition. The number of parameters isonly reported for the auxiliary network, and - indicates no auxiliary network in the method.

(a) Fashion

(b) SVHN

(c) CIFAR-10

Figure 3: Test accuracy overthe acquisition iterations

have any learning mechanism on the acquisition metric, andthis group has the worst performances. We suggest three ac-quisition functions to be adopted by our LADA framework,which are 1) the predictive entropy by the classifier, 2) thediscriminator logits in VAAL, and 3) the classifier variationratio. Given that VAAL uses a discriminator and a generator,the VAAL-based model has more parameters to optimize interms of complexity, which provides an advantage in a com-plex dataset, such as CIFAR-10.

When we examine the general performance gains acrossdatasets, we find the best performers as LADAVarMix inFashion; and LADAVaalMix in SVHN and CIFAR-10. Interms of the data augmentation, Mixup-based augmentationoutperforms STN augmentation. As the dataset becomescomplex, the greater gain of performance is achieved byLADAEntSTN in SVHN or CIFAR-10, compared to Fash-ion. In all combinations of baselines and datasets, the inte-grations of augmentation and acquisition, a.k.a. LADA vari-ations, show the best performance in most cases. In terms ofthe ablation study, the learning of the data augmentation pol-icy, τ , is meaningful because the 10 learning case of LADAis better than the fixed case in 12 variations of LADA. Fig-ure 3 shows the convergence speed to the best test accuracyby each model. As the dataset becomes complex, the perfor-mance gain by LADA becomes apparent.

Additionally, we compare the integrated framework to thepipelined approach. Max Entropy does not have an augmen-tation part, so it becomes the simplest model. Then, Entw.Manifold Mixup adds the Manifold Mixup augmentation,but it does not have a learning process on the mixing policy.Finally, Ent w.AdaMixup has a learning process on the mix-

ing policy, but the learning is separated from the acquisition.These pipelined approaches show lower performances thanthe integration cases of LADA.

Finally, as LADA is a generalizable framework to workwith the various acquisition and augmentation functions,Figure 4a and Figure 4b show the ablation study on the in-stantiated LADA with the VAAL acquisition function andthe STN augmentation function, respectively. The figuresconfirm the effects of both integration and learnable aug-mentation policy with feedback from the acquisition.

(a) LADAVaalMix (b) LADAEntSTN

Figure 4: Ablation study of LADA with various acquisitionand augmentation on CIFAR-10 dataset

Qualitative Analysis on Acquired Data InstancesBesides the quantitative comparison, we need reason-ing on the behavior of LADA. Therefore, we selectedLADAEntMix to contrast to the pipelined approach. We in-vestigate on 1) achieving the informative data instances byacquisition, 2) the validity of the optimal τ∗ in the augmen-tation learned from the policy generator network πφ, and 3)examining the coverage of the explored space.

(a) Max Entropy at early (b) Max Entropy at late

(c) LADAEntMix at early (d) LADAEntMix at late

Figure 5: tSNE (Maaten and Hinton 2008) plot of acquiredinstance (?) and augmented instance (×), with entropy val-ues. The numbers written in black indicate the predictive en-tropy of unlabeled data instances that were selected fromthe unlabeled pool. The numbers written in red indicate themaximum (*average) value of predictive entropy of the vir-tual data instances that were generated from InfoMixup. Theacquisition iterations of early and late are 7 and 76.

To check the informativeness of data instances, Figure 5shows the different acquisition process between Max En-tropy and LADAEntMix. Max Entropy selects a data in-stance with the highest predictive entropy value. Comparedto Max Entropy, LADAEntMix selects a pair of two data in-stances with the highest value of the aggregated predictiveentropy, which is the summation of the predictive entropyfrom two data instances and one InfoMixup instance. Bymixing two unlabeled data instances with the correspondingoptimal mixing policy τ∗, the virtual data instance, gener-ated along with the vicinal space, results in a high entropyvalue, which can be higher than the selected instance by MaxEntropy. The virtual data instance helps the current classifiermodel to clarify the decision boundary between two classesalong the interpolation line of two mixed real instances.

To confirm the validity of the optimal τ∗, we comparethree cases of 1) the inferred τ (LADAEntMix); 2) thefixed τ (LADAfixed

EntMix); and 3) the pipelined model’s τ (Entw.Manifold Mixup). Figure 6a shows the entropy of the vir-tual data instances over the acquisition process. As expected,the optimal τ∗ learned in LADAEntMix produces the highestentropy over the acquisition process, but it should be notedthat the differentiation becomes significant after some ac-quisition iterations, which comes from the requirement oftraining the classifier. Figure 6b shows the distribution ofentropy of virtual instances, with the median value of eachinterval as x-axis. This also shows that the optimal τ∗ hasthe highest density beyond the interval of the median 2.2.

To examine the coverage of the explored latent space,Figure 7 illustrates the latent space of the acquired data in-stances and the augmented data instances. Ent w.AdaMixuphas a potential capability of interpolating distantly paireddata instances, but its learned τ limits a sample of λ tobe placed near either one of the paired instances because

(a) Mean value of entropyof the virtual data

(b) Number of virtual datawith entropy values

Figure 6: Entropy values of the virtual data generated fromLADAEntMix, LADAfixed

EntMix, and Ent w.Manifold Mixup

of the aversion on the manifold intrusion. Therefore, in theexperiments, Ent w.AdaMixup ends up exploring the spacenear the acquired instances. The generated virtual data in-stances by LADAEntMix show further exploration than Entw.AdaMixup. The latent space makes the linear interpolationof LADAEntMix to be curved by the manifold, but it keepsthe interpolation line of the curved manifold. The extent ofthe interpolation is broader than AdaMixup because the op-timal τ∗ is guided by the entropy maximization, which isadversarial in a sense. This adversarial approach is differentfrom the aversion of the manifold intrusion because the latteris more conservative to the currently learned parameter.

(a) LADAEntMix (b) Ent w.AdaMixup

Figure 7: tSNE plot of acquired data instances (?) and gen-erated virtual data instances (×). The labels are categorizedby colors.

ConclusionsIn the real world where gathering a large-scale labeleddataset is difficult because of the constrained human orcomputational resources, learning the deep neural networkrequires an effective utilization of the limited resources.This limitation motivated the integration of data augmenta-tion and active learning. This paper proposes a generalizedframework for such integration, named as LADA, whichadaptively selects the informative data instances by look-ing ahead the acquisition score of both 1) the unlabeled datainstances and 2) the virtual data instances to be generatedby data augmentation, in advance of the acquisition pro-cess. To enhance the effect of the data augmentation, LADAlearns the augmentation policy to maximize the acquisitionscore. With repeated experiments on the various datasets andthe comparison models, LADA shows a considerable per-formance by selecting and augmenting informative data in-stances. The qualitative analysis shows the different behav-ior of LADA that finds the vicinal space of high acquisitionscore by learning the optimal policy.

Ethics StatementIn the real world, the limited amount of labeled datasetmakes it hard to train the deep neural networks and highcost of the annotation cost becomes problematic. This leadsto the decision on what to select and annotate first, whichcalls upon the active learning. Besides the active learning,effectively enlarging the limited amount of labeled datasetis also considerable. With this motivations, we propose aframework that can adopt various types of acquisitions andaugmentations that exist in machine learning field. By look-ing ahead the effect of data augmentation in the process ofacquisition, we can select data instances that are informa-tive if selected and labeled but also augmented. Moreover,by learning the augmentation policy in advance of the ac-tual acquisition process, we enhance the informativeness ofthe generated virtual data instances. We believe that the pro-posed LADA framework can improve the performance ofdeep learning models, especially when the annotation by hu-man experts is expensive.

ReferencesAsh, J. T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; andAgarwal, A. 2020. Deep Batch Active Learning by Diverse,Uncertain Gradient Lower Bounds. In ICLR.Chapelle, O.; Weston, J.; Bottou, L.; and Vapnik, V. 2001.Vicinal risk minimization. In Advances in neural informa-tion processing systems, 416–422.Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1996. Ac-tive learning with statistical models. Journal of artificialintelligence research 4: 129–145.Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le,Q. V. 2019. Autoaugment: Learning augmentation strate-gies from data. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, 113–123.Freeman, L. 1965. Elementary applied statistics: for stu-dents in behavioral science. Wiley. URL https://books.google.co.kr/books?id=r4VRAAAAMAAJ.Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative Adversarial Nets. In Proceedings of the27th International Conference on Neural Information Pro-cessing Systems - Volume 2, NIPS’14, 2672–2680. Cam-bridge, MA, USA: MIT Press.Guo, H.; Mao, Y.; and Zhang, R. 2019. Mixup as locallylinear out-of-manifold regularization. In Proceedings ofthe AAAI Conference on Artificial Intelligence, volume 33,3714–3722.Hastings, W. K. 1970. Monte Carlo sampling methods usingMarkov chains and their applications .He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, 770–778.Houlsby, N.; Huszar, F.; Ghahramani, Z.; and Lengyel, M.2011. Bayesian Active Learning for Classification and Pref-erence Learning. CoRR abs/1112.5745.

Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015.Spatial transformer networks. In Advances in neural infor-mation processing systems, 2017–2025.Jankowiak, M.; and Karaletsos, T. 2019. Pathwise Deriva-tives for Multivariate Distributions. In Chaudhuri, K.; andSugiyama, M., eds., The 22nd International Conference onArtificial Intelligence and Statistics, AISTATS 2019, 16-18April 2019, Naha, Okinawa, Japan, volume 89 of Proceed-ings of Machine Learning Research, 333–342. PMLR. URLhttp://proceedings.mlr.press/v89/jankowiak19a.html.Jankowiak, M.; and Obermeyer, F. 2018. Pathwise Deriva-tives Beyond the Reparameterization Trick. In Dy, J. G.;and Krause, A., eds., Proceedings of the 35th Interna-tional Conference on Machine Learning, ICML 2018, Stock-holmsmassan, Stockholm, Sweden, July 10-15, 2018, vol-ume 80 of Proceedings of Machine Learning Research,2240–2249. PMLR. URL http://proceedings.mlr.press/v80/jankowiak18a.html.Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Vari-ational Bayes. In Bengio, Y.; and LeCun, Y., eds., 2nd In-ternational Conference on Learning Representations, ICLR2014, Banff, AB, Canada, April 14-16, 2014, ConferenceTrack Proceedings. URL http://arxiv.org/abs/1312.6114.Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiplelayers of features from tiny images .Liu, B.; and Ferrari, V. 2017. Active learning for humanpose estimation. In Proceedings of the IEEE InternationalConference on Computer Vision, 4363–4372.Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data usingt-SNE. Journal of machine learning research 9(Nov): 2579–2605.Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; andNg, A. Y. 2011. Reading digits in natural images with unsu-pervised feature learning .Perez, L.; and Wang, J. 2017. The effectiveness of data aug-mentation in image classification using deep learning. arXivpreprint arXiv:1712.04621 .Sener, O.; and Savarese, S. 2018. Active Learning for Con-volutional Neural Networks: A Core-Set Approach. In In-ternational Conference on Learning Representations.Settles, B. 2009. Active learning literature survey. Techni-cal report, University of Wisconsin-Madison Department ofComputer Sciences.Shannon, C. E. 1948. A mathematical theory of communi-cation. Bell Syst. Tech. J. 27(3): 379–423.Sinha, S.; Ebrahimi, S.; and Darrell, T. 2019. Variationaladversarial active learning. In Proceedings of the IEEE In-ternational Conference on Computer Vision, 5972–5981.Tong, S. 2001. Active learning: theory and applications,volume 1. Stanford University USA.Tran, T.; Do, T.; Reid, I. D.; and Carneiro, G. 2019. BayesianGenerative Active Deep Learning. In Chaudhuri, K.; andSalakhutdinov, R., eds., Proceedings of the 36th Interna-tional Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97

of Proceedings of Machine Learning Research, 6295–6304.PMLR. URL http://proceedings.mlr.press/v97/tran19a.html.Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Courville,A.; Mitliagkas, I.; and Bengio, Y. 2018. Manifold mixup:Learning better representations by interpolating hiddenstates .Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist:a novel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 .Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D.2017. mixup: Beyond empirical risk minimization. arXivpreprint arXiv:1710.09412 .