9
Sequential Graph Convolutional Network for Active Learning Razvan Caramalau, 1 Binod Bhattarai, 1 Tae-Kyun Kim 1 1 Imperial College of London {r.caramalau18, b.bhattarrai, tk.kim}@imperial.ac.uk Abstract We propose a novel and a generic sequential Graph Convolu- tion Network (GCN) training for pool-based Active Learning. Each of the unlabelled and labelled examples is represented as nodes of a graph and their similarities as edges. With the available few labelled examples as seed annotations, the pa- rameters of the Graphs are optimised to minimise the binary cross-entropy loss to identify labelled vs unlabelled. Based on the confidence score of the nodes in the graph, we sub-sample unlabelled examples to annotate where inherited uncertain- ties correlate. With the newly annotated examples along with the existing ones, the parameters of the graph are learned to minimise the modified objective. We evaluate our method on four publicly available image classification benchmarks and compared with several existing arts such as VAAL, Learn- ing Loss, CoreSet etc. Our method outperforms these existing methods and attains the new state-of-the-art.The implemen- tations of this paper can be found here: https://github.com/ razvancaramalau/Sequential-GCN-for-Active-Learning Introduction Deep learning has shown great advancements in computer vision (He et al. 2016; Krizhevsky, Sutskever, and Hinton 2012) due to the availability of both the powerful computing infrastructure and a large-scale annotated data set. Data an- notation is a time-consuming task, needs experts and is also expensive. This will get even more challenging to some of the specific domains such as medical image analysis. More- over, while optimizing deep neural network architectures a gap is present concerning the representatives of the data. To overcome these issues, active learning (Gal, Islam, and Ghahramani 2017; Kontorovich, Sabato, and Urner 2016) has been successfully deployed to efficiently select the most meaningful and representative samples. In this paper, we propose a novel, generic and task- agnostic sequential Graph Convolutional Network (GCN) for Active Learning. Majority of the previous works (Cortes and Vapnik 1995; Gal, Islam, and Ghahramani 2017; Beluch Bcai, N ¨ urnberger, and Bcai 2018; Abel and Louzoun 2019; Wu et al. 2019; Cai, Zheng, and Chen-Chuan Chang 2017; Gao et al. 2018) implant both the selection method and the learner, the two major components of the active learning framework, together. This design limits the model to become optimal for a specific type of task taken into consideration. In comparison to these works, we propose to train the frame- work in a task-agnostic manner similar to that of recent work on VAAL (Sinha, Ebrahimi, and Darrell 2019) and Learning Loss (Yoo and Kweon 2019). Figure 1 shows the pipeline of the proposed method. We propose to implement the learner and the sampler by a stan- dard Convolutional Neural Network (CNN) and the Graph Convolutional Networks (GCNs) (Kipf and Welling 2017; Bronstein et al. 2017) respectively. The learner is trained to minimise the objective of down-stream task with the avail- able label queried examples. At the initial stage, we take very few annotated examples as seed examples. We repre- sent both the labelled and unlabelled data in the form of a graph. Each node in the graph represent an image and the edges between them encodes their relation. Initial represen- tations of the nodes are extracted from the learner. We con- volve on graph to induce the higher-order representations by performing message passing operations between the nodes. The new representations depend on how strong is the con- nection within the other labelled and unlabelled images. We learn the parameters of the graph to classify the nodes as labelled or unlabelled. As this objective is independent of downstream (learner) task, our method is task-agnostic. We sort the examples based on the confidence score of a node to be labelled one, apply an uncertainty sampling approach, and send the selected examples to query their labels. In the following iteration, we flip the label of annotated examples from unlabelled to labelled. With these new annotations, we train the framework in an end-to-end fashion. Recently, VAAL (Sinha, Ebrahimi, and Darrell 2019) trained a varia- tional auto-encoder (VAE) (Pu et al. 2016) in an adversar- ial manner to map images to a latent space where labelled vs unlabelled examples better discriminate. However, this method does not take into account the relative neighbour- hood relationship between the examples. Also, there is no such mechanism for message passing between the examples. Furthermore, we adapt the node information under the Core- Set (Sener and Savarese 2018) for a new sampling technique by introducing latent space distancing. In principle, CoreSet (Sener and Savarese 2018) uses risk minimisation between core-sets over the learner feature space while ours employs GCN layers over them. Thus, our method has a clear ad- vantage due to the aforementioned strengths of the GCN which is demonstrated by both the quantitative and quali- arXiv:2006.10219v2 [cs.CV] 21 Sep 2020

Sequential Graph Convolutional Network for Active Learning · very few annotated examples as seed examples. We repre-sent both the labelled and unlabelled data in the form of a graph

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Sequential Graph Convolutional Network for Active Learning

    Razvan Caramalau, 1 Binod Bhattarai,1 Tae-Kyun Kim 11Imperial College of London

    {r.caramalau18, b.bhattarrai, tk.kim}@imperial.ac.uk

    Abstract

    We propose a novel and a generic sequential Graph Convolu-tion Network (GCN) training for pool-based Active Learning.Each of the unlabelled and labelled examples is representedas nodes of a graph and their similarities as edges. With theavailable few labelled examples as seed annotations, the pa-rameters of the Graphs are optimised to minimise the binarycross-entropy loss to identify labelled vs unlabelled. Based onthe confidence score of the nodes in the graph, we sub-sampleunlabelled examples to annotate where inherited uncertain-ties correlate. With the newly annotated examples along withthe existing ones, the parameters of the graph are learned tominimise the modified objective. We evaluate our method onfour publicly available image classification benchmarks andcompared with several existing arts such as VAAL, Learn-ing Loss, CoreSet etc. Our method outperforms these existingmethods and attains the new state-of-the-art.The implemen-tations of this paper can be found here: https://github.com/razvancaramalau/Sequential-GCN-for-Active-Learning

    IntroductionDeep learning has shown great advancements in computervision (He et al. 2016; Krizhevsky, Sutskever, and Hinton2012) due to the availability of both the powerful computinginfrastructure and a large-scale annotated data set. Data an-notation is a time-consuming task, needs experts and is alsoexpensive. This will get even more challenging to some ofthe specific domains such as medical image analysis. More-over, while optimizing deep neural network architectures agap is present concerning the representatives of the data.To overcome these issues, active learning (Gal, Islam, andGhahramani 2017; Kontorovich, Sabato, and Urner 2016)has been successfully deployed to efficiently select the mostmeaningful and representative samples.

    In this paper, we propose a novel, generic and task-agnostic sequential Graph Convolutional Network (GCN)for Active Learning. Majority of the previous works (Cortesand Vapnik 1995; Gal, Islam, and Ghahramani 2017;Beluch Bcai, Nürnberger, and Bcai 2018; Abel and Louzoun2019; Wu et al. 2019; Cai, Zheng, and Chen-Chuan Chang2017; Gao et al. 2018) implant both the selection method andthe learner, the two major components of the active learningframework, together. This design limits the model to becomeoptimal for a specific type of task taken into consideration.

    In comparison to these works, we propose to train the frame-work in a task-agnostic manner similar to that of recent workon VAAL (Sinha, Ebrahimi, and Darrell 2019) and LearningLoss (Yoo and Kweon 2019).

    Figure 1 shows the pipeline of the proposed method. Wepropose to implement the learner and the sampler by a stan-dard Convolutional Neural Network (CNN) and the GraphConvolutional Networks (GCNs) (Kipf and Welling 2017;Bronstein et al. 2017) respectively. The learner is trained tominimise the objective of down-stream task with the avail-able label queried examples. At the initial stage, we takevery few annotated examples as seed examples. We repre-sent both the labelled and unlabelled data in the form of agraph. Each node in the graph represent an image and theedges between them encodes their relation. Initial represen-tations of the nodes are extracted from the learner. We con-volve on graph to induce the higher-order representations byperforming message passing operations between the nodes.The new representations depend on how strong is the con-nection within the other labelled and unlabelled images. Welearn the parameters of the graph to classify the nodes aslabelled or unlabelled. As this objective is independent ofdownstream (learner) task, our method is task-agnostic. Wesort the examples based on the confidence score of a nodeto be labelled one, apply an uncertainty sampling approach,and send the selected examples to query their labels. In thefollowing iteration, we flip the label of annotated examplesfrom unlabelled to labelled. With these new annotations,we train the framework in an end-to-end fashion. Recently,VAAL (Sinha, Ebrahimi, and Darrell 2019) trained a varia-tional auto-encoder (VAE) (Pu et al. 2016) in an adversar-ial manner to map images to a latent space where labelledvs unlabelled examples better discriminate. However, thismethod does not take into account the relative neighbour-hood relationship between the examples. Also, there is nosuch mechanism for message passing between the examples.Furthermore, we adapt the node information under the Core-Set (Sener and Savarese 2018) for a new sampling techniqueby introducing latent space distancing. In principle, CoreSet(Sener and Savarese 2018) uses risk minimisation betweencore-sets over the learner feature space while ours employsGCN layers over them. Thus, our method has a clear ad-vantage due to the aforementioned strengths of the GCNwhich is demonstrated by both the quantitative and quali-

    arX

    iv:2

    006.

    1021

    9v2

    [cs

    .CV

    ] 2

    1 Se

    p 20

    20

    https://github.com/razvancaramalau/Sequential-GCN-for-Active-Learning https://github.com/razvancaramalau/Sequential-GCN-for-Active-Learning

  • tative experiments (See Sec. Experiments). Active learningfor GCNs (Cai, Zheng, and Chen-Chuan Chang 2017; Gaoet al. 2018; Abel and Louzoun 2019; Wu et al. 2019) hasalso been getting attention. However, these methods are con-strained under a GCN learner architecture, whilst our setupmakes the framework more versatile.

    We evaluated our sampling methods on four challengingbenchmarks and compared with several baselines and ex-isting state-of-the-arts including CoreSet and VAAL. Fromboth the quantitative and the qualitative comparisons, ourmethod is more accurate and efficient than existing methods.

    Related WorksModel-based methods. Recently, a new category of meth-ods is explored in the active learning paradigm where a dif-ferent model than a learner is trained for selecting a sub-setof the most representative data. Our method is based on thiscategory. One of the first approaches (Yoo and Kweon 2019)attached a loss-learning module so that loss can be predictedoffline for the unlabelled samples. In (Sinha, Ebrahimi, andDarrell 2019), another task-agnostic solution deploys a vari-ational auto-encoder (VAE) to map the available data on alatent space. Thus, a discriminator is trained in an adversar-ial manner to classify labelled from unlabelled. The advan-tage of our method over this approach is the exploitation ofthe relative relationship between the examples by sharing in-formation through message passing operations in GCN.GCNs in active learning. GCNs (Kipf and Welling 2017)have opened new active learning methods that have beensuccessfully applied in (Abel and Louzoun 2019; Wu et al.2019; Cai, Zheng, and Chen-Chuan Chang 2017; Gao et al.2018). In comparison to these methods, our approach hasdistinguished learner and sampler. It makes our approachtask-agnostic and also gets benefited from model-basedmethods mentioned just before. Moreover, none of thesemethods is trained in a sequential manner. (Wu et al. 2019)proposes K-Medoids clustering for the feature propagationbetween the labelled and unlabelled nodes. A regional un-certainty algorithm is presented in (Abel and Louzoun 2019)by extending the PageRank (Page et al. 1999) algorithmto the active learning problem. Similarly, (Cai, Zheng, andChen-Chuan Chang 2017) combines node uncertainty withgraph centrality for selecting the new samples. A more com-plex algorithm is introduced in (Gao et al. 2018) where areinforcement scheme with multi-armed bandits decides be-tween the three query measurements from (Cai, Zheng, andChen-Chuan Chang 2017). However, these works in ((Cai,Zheng, and Chen-Chuan Chang 2017; Gao et al. 2018; Wuet al. 2019)) derive their selection mechanism on the as-sumption of a Graph learner. This does not make them di-rectly comparable with our proposal where a GCN is trainedseparately for a different objective function than the learner.Moreover, our input nodes together with the Graph structureare dynamically changing with every query stage, whilst thementioned works rely on fixed pre-computed features.Uncertainty-based methods. Earlier techniques for sam-pling unlabelled data have been explored through uncer-tainty exploration of the CNN. A Bayesian approximation

    introduced in (Gal and Ghahramani 2016) produce mean-ingful uncertainty measurements by variational inference ofa Monte Carlo Dropout (MC Dropout) adapted architecture.Hence, it is successfully integrated into active learning by(Gal, Islam, and Ghahramani 2017; Houlsby et al. 2011;Kirsch, Van Amersfoort, and Gal 2019; Pinsler et al. 2019).With the rise of GPU computation power, (Beluch Bcai,Nürnberger, and Bcai 2018) ensemble-based method outper-formed MC Dropout.Geometric-based methods. Although there have been stud-ies exploring the data space through the representations ofthe learning model ((Kontorovich, Sabato, and Urner 2016;Tsang, Kwok, and Cheung 2005; Har-Peled and Kushal2005)), the first work applying it for CNNs as an activelearning problem, CoreSet, has been presented in (Sener andSavarese 2018). The key principle depends on minimisingthe difference between the loss of a labelled set and a smallunlabelled subset through a geometric-defined bound. Wefurthermore represent this baseline in our experiments as itsuccessfully over-passed the uncertainty-based ones.

    MethodIn this section, we describe the proposed methodology indetails. We have a scenario where there are plenty of unla-belled data, and a limited budget to annotate the examples.The objective is to label within this budget the most repre-sentative examples that yield the most generalisable modelparameters on unseen examples. An active learning frame-work incrementally selects the number of training examples.At high-level, the active learning framework has three ma-jor components: a learning algorithm, a labelling oracle anda sampling/selection mechanism. First, we briefly presentthe learner for the image classification task under the pool-based active learning scenario. In the second part, we dis-cuss our contribution of a novel sequential GCN selectionmechanism adapted with two of the best sampling tech-niques to date: CoreSet (Sener and Savarese 2018) and un-certainty sampling (Lewis and Gale 1994). We define thesequery methods as CoreGCN, the geometric approach in-spired from (Sener and Savarese 2018), and the uncertainty-based as UncertainGCN. A visual description of our pro-posed pipeline is illustrated in Figure 1.

    LearnerIn our application, the learner acts as an image classi-fier where C classes are identified. We deploy a deepCNN model M, ResNet-18 (He et al. 2016) or VGG-11(Simonyan and Zisserman 2015), that maps a set of in-puts x ∈ X to a discriminatory space of outputs y ∈ Ywith parameters θ. However, as we mentioned before, anyother type of discriminative model can be deployed as theobjective function of the learner is different than that of thesampler. A loss function L(x,y; θ) is minimized during thetraining process. The objective function of our classifier iscross-entropy defined as below:

    LM(x,y; θ) = −1

    Nl

    Nl∑i=1

    yi log(f(xi,yi; θ)), (1)

  • Du

    D0

    D1

    0.14

    0.1

    0.2

    0.2

    0.12

    0.16

    0.63

    0.48

    0.520.43

    0.6 0.54

    Du

    D0

    D1

    0.14

    0.1

    0.2

    0.2

    0.12

    0.16

    0.63

    0.48

    0.52

    0.43

    0.6 0.54

    Testing

    vTraining

    ImageClassifier Classes

    GCNLayer GCNLayerUncertainGCN

    CoreGCN

    IndicestolabelOracleIndicestolabel

    Newlabelledimages

    Labelledimages

    Unlabelledimages

    GCNLabelClassifier

    Confidencescores

    Edges

    Figure 1: Schematic diagram showing the proposed pipeline. Here, Image classifier depicts our learner and GCN block repre-sents the selection framework. For sanity, we are not showing all the edges between the nodes in the graph.

    where Nl is the number of labelled training examples andf(xi,yi; θ) is the posterior probability of the modelM.

    Successfully developed also in recent arts (Sinha,Ebrahimi, and Darrell 2019; Yoo and Kweon 2019), pool-based scenario has become a standard in deep learning sys-tems. Therefore, we consider an unlabeled dataset DU fromwhich we randomly select an initial batch for labelling D0⊂DU . Without loss of generality, in active learning research,the aim is to identify an acquisition functionA with which aminimum loss can be achieved with least number of batchesDn. This scope can be simply defined for n number of activelearning stages as following:

    minn

    minLMA(LM(x,y; θ)|D0⊂ · · · ⊂Dn⊂DU ). (2)

    We want to minimise the number of stages so that fewersamples (x,y) would require annotation. For the acquisitionfunction A, we bring the heuristic relation between the dis-criminative understanding of the model and the unlabelleddata space. This is quantified by a performance metric andtraced at every querying stage.

    Sequential GCN selection processDuring sampling, our contribution relies on sequentiallytraining a GCN initialised with the features extracted fromthe learner for both labelled and unlabelled images atevery active learning stage. As stated before, similar toVAAL (Sinha, Ebrahimi, and Darrell 2019), we consider thismethodology as model-based where a separate architectureis required for sampling. Our motivation in introducing thegraph is primarily in propagating the inherited uncertaintyon the learner feature space between the samples (nodes).Thus, message passing between the nodes after applyingconvolution on graph induces higher-order representationsof the nodes. Finally, our GCN will act as a binary classifierdeciding which images are annotated.

    Graph Convolutional Network. The key components ofa graph, G are the nodes, also called vertices V and the edgesE . The edges capture the relationship between the nodes andencoded in an adjacency matrix A. The nodes v ∈ IR(m×N)of the graph encode image-specific information and are ini-tialised with the features extracted from the learner. Here, Nrepresents the total number of both labelled and unlabelledexamples while m represents the dimension of features foreach node. After we apply l2 normalisation to the features,the initial elements of A result as vector product betweenevery sample of v i.e. (Sij = v>i vj , {i, j} ∈ N ). Further-more, we subtract from S the identity matrix I and then wenormalise by multiplying with its degree D. Finally, we addthe self-connections back so that the closest correlation iswith the node itself. This can simply be summarised under:

    A = D−1(S − I) + I. (3)

    To avoid over-smoothing of the features in GCN (Kipf andWelling 2017), we design a two-layer architecture. The firstGCN layer can be described as a function f1G(A,V; Θ1) :IRN×N × IRm×N → IRh×N where h is number of hiddenunits and Θ1 are its parameters. A rectified linear unit activa-tion (Nair and Hinton 2010) is applied after the first layer tomaximise feature contribution. However, to map the nodesas labelled or unlabelled, the final layer is activated througha sigmoid function. Thus, the output of fG is a vector lengthof N with values between 0 and 1 (where 0 is consideredunlabelled and 1 is for labelled). We can further define theentire network function as:

    fG = σ(Θ2(ReLU(Θ1A)A). (4)

    In order to satisfy this objective, our loss function will be

  • defined as:

    LG(V, A; Θ1,Θ2) = −1

    Nl

    Nl∑i=1

    log(fG(V, A; Θ1,Θ2)i)−

    − λN −Nl

    N∑i=Nl+1

    log(1− fG(V, A; Θ1,Θ2)i),

    (5)

    where λ acts as a bias between the labelled and unlabelledcross-entropy.

    Uncertainty sampling for GCN. Once the training of theGCN is complete, we move forward to the sampling stage.From the remaining unlabelled samples DU , we can drawtheir confidence scores fG(vi;DU ) as outputs of the GCN.Similarly to uncertainty sampling, we propose to select withour method, UncertainGCN, the unlabelled images with theconfidence depending on a variable smargin. While query-ing a fixed number of b points for a new subset DL, we applythe following equation:

    DL = DL ∪ arg maxi=1···b

    |smargin − fG(vi;DU)|. (6)

    This stage is repeated as long as equation 2 is satisfied.In the experiment section, we will also analyse the choicefor smargin while the confident unlabelled scores should becloser to 0. Algorithm 1 summarises the GCN sequentialtraining with the UncertainGCN sampling method.

    Algorithm 1 UncertainGCN active learning algorithm1: Given: Initial labelled set D0, unlabelled set DU and

    query budget b2: Initialise (xL,yL), (xU ) - labelled and unlabelled im-

    ages3: repeat4: θ ← f(xL,yL) . Train learner with labelled5: V = [vL,vU ]← f(xL ∪ xU ; θ) . Extract features

    for labelled and unlabelled6: Compute adjacency matrixA according to Equation

    37: Θ← fG(V, A) . Train the GCN8: for i = 1→ b do9: DL = DL ∪ arg maxi|smargin − fG(v;DU )| .

    Add nodes depending on the label confidence10: end for11: Label yU given new DL12: until Equation 2 is satisfied

    CoreSet sampling for GCN. To integrate geometric in-formation between the labelled and unlabelled graph rep-resentation, we approach a CoreSet technique (Sener andSavarese 2018) in our sampling stage. This has shown bet-ter performance in comparison to uncertainty-based meth-ods (Wu et al. 2019). (Sener and Savarese 2018) shows howbounding the difference between the loss of the unlabelledsamples and the one of the labelled is similar to the k-Centreminimisation problem stated in (Wolf 2011).

    In this approach, the sampling is based on the l2 distancesbetween the features extracted from the trained classifier. In-stead of that, we will make use of our GCN architecture byapplying CoreSet method on the features represented afterthe first layer of the graph. The sampling method is adaptedto our mechanism for each b data point under the equation:

    DL = DL∪arg maxi∈DU

    minj∈DL

    δ(f1G(A,vi; Θ1), f1G(A,vj ; Θ1)),

    (7)where δ is the Euclidean distance between the graph featuresof the labelled node vi and the ones from the unlabellednode vj . We define this method as CoreGCN.

    Finally, given the model-based mechanism, we claim thatour acquisition functions are task-agnostic as long as thelearner is producing a form of feature representations. In thefollowing section, we will experimentally demonstrate theperformance of our method quantitatively and qualitatively.

    ExperimentsImplementation details. For the most of the experiments,we set ResNet-18 (He et al. 2016) as learner due to its highaccuracy and better training stability in comparison to theother contemporary architectures. We took a batch size of128. We optimise with Stochastic Gradient Descent (SGD)with a weight decay 5× 10−4 and momentum 0.9. At everyselection stage, we train the model in 200 epochs startingwith a learning rate of 0.1 and decreasing it to 0.01 after 160epochs. We keep these hyper-parameters same throughoutall experiments. Similarly, our acquisition function (sam-pler) is approximated by the GCN trained sequentially. Wechoose 2 layers of GCN and set dropout to 0.3 to avoid oversmoothing (Zhao and Akoglu 2020). The objective functionis binary cross-entropy per node. We set the value of λ = 1.2to give more importance to unlabelled points. For optimi-sation, we chose Adam (Kingma and Lei Ba 2015) with aweight decay of 5 × 10−4 and a learning rate of 10−3. Thenodes of the graphs are initialised with the features of theimages extracted from the learner network with the latestmodel parameters. And these features are projected to the di-mension of 512. Such initialisation connects the learner withthe acquisition function (sampler). We performed ablationstudies to select these hyper-parameters which we presentshortly. In the uncertainty-based sampling method, accord-ing to equation 6, we have the smargin variable to set. Fromour ablation studies, we observed that our method selects themost informative samples when smargin equals 0.1. This isintuitively correct as unlabelled samples with scores furtherfrom 0.1 will be selected.Compared methods. We compare the performance of ourselection mechanism against following four comparable andstate-of-the-art techniques.Random: For each subset, we uniformly sample b pointsfrom the entire unlabelled data set. This is the most com-mon practice.CoreSet(Sener and Savarese 2018):The purpose of thismethod is to find b unlabelled samples that will act as newcentres while the now reduced radius will cover the entirefeature space from the learner.

  • VAAL(Sinha, Ebrahimi, and Darrell 2019): We consider thiswork as one of the closest to our work. This method trainsVAE with both labelled and unlabelled samples to learn adiscriminative subspace in an adversarial manner.Learning Loss(Yoo and Kweon 2019): This method is state-of-the-art method to date. It proposes to minimise learningloss as an additional loss in the learner.FeatProp(Wu et al. 2019): We selected this GCN-based al-gorithm for comparison because it achieves the best per-formance on citation datasets against the recent activelearning frameworks for graphs( (Cai, Zheng, and Chen-Chuan Chang 2017; Abel and Louzoun 2019; Gao et al.2018). FeatProp firstly computes a distance matrix from theinput features and then applies k-Means clustering on topof it. Consequently, it selects the unlabelled samples closerto the centroids. As this method expects pre-defined inputfeatures, in our scenario, We will deploy this algorithm asa sampling technique because it requires pre-defined inputfeatures. Furthermore, at every selection stage, we generatenew features with the ResNet-18 model.Datasets and experimental settings. We evaluated ourmethod together with the others on four challenging imageclassification benchmarks: CIFAR-10(Krizhevsky 2012),CIFAR-100(Krizhevsky 2012), FashionMNIST(Xiao, Ra-sul, and Vollgraf 2017) and SVHN(Goodfellow et al. 2013).Each of the datasets has different properties and present newchallenges for the active learning framework. CIFAR-10consists of 50,000 images for training and 10,000 for testing.There are 5,000 samples for each of the 10 object categories.CIFAR-100 is constructed in a similar fashion with the samesize of the training and testing set. The difference stands inthe granularity of the data distribution as 100 classes arecategorised (500 images corresponding to each class). TheSVHN dataset represents 10 digit classes in 73,257 train im-ages and 26,032 test images. Finally, FashionMNIST con-tains training and testing sets of the size 60,000 and 10,000,respectively, with annotations of 10 clothing designs.

    For every dataset, we assume that all training images areunlabelled (DU ) in the beginning. The initial labelled setDL will be formed of 1,000 random sampled images forthe CIFAR-10, SVHN and FashionMNIST experiments. Be-cause of the fine-grained structure of CIFAR-100, duringthe first active learning stage, 2,000 will be selected for la-belling. We will conduct our experiments along 10 stages,while at every stage the same number of queries will be pro-duced by the proposed acquisition functions (1,000 b pointsfor the 10-class datasets and 2,000 for CIFAR-100). Sim-ilarly to the works in (Beluch Bcai, Nürnberger, and Bcai2018; Yoo and Kweon 2019), before selection, we will cre-ate randomly a subset DS ⊂ DU from the unlabelled im-ages to avoid the redundancy occurrence common to thosedatasets. The DS size will be 10,000 for all the active learn-ing stages in every experiment.Evaluation Metrics. We report the mean average accuracyof the 5 trials on test sets for every dataset.Quantitative comparisons. Firstly, we evaluated ResNet-18 for all the benchmarks on their entire data. We ob-tain for CIFAR-10 and CIFAR-100 a classification rate of93.09% and 73.02%, respectively. For the other two data

    2000 4000 6000 8000 10000 12000 14000 16000 18000 20000Number of labelled images

    20

    30

    40

    50

    60

    70

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on CIFAR-100

    RandomUncertainGCNVAALLearning LossCoreSetCoreGCNFeatProp

    Figure 2: Quantitative comparison on CIFAR-10(left) andCIFAR-100(right) (Zoom in the view)

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Number of labelled images

    80

    82

    84

    86

    88

    90

    92

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on FashionMNIST

    RandomUncertainGCNVAALLearning LossCoreSetCoreGCNFeatProp

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Number of labelled images

    65

    70

    75

    80

    85

    90

    95

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on SVHN

    RandomUncertainGCNVAALLearning LossCoreSetCoreGCNFeatProp

    Figure 3: Quantitative comparison on FashionMNIST(left)and SVHN(right) (Zoom in the view)

    sets, our learner yielded 93.74% on FashionMNIST, whereason SVHN we obtain 95.35%.

    Figure 2 (left) shows the performance of UncertainGCNand CoreGCN against the other four existing arts on CIFAR-10 dataset under the specified experiment settings. The solidline of the representation is the mean averaged accuracy,while the faded colour shows the standard deviation after the5 trials. Our GCN-based mechanism with the two samplingtechniques surpasses random sampling and VAAL at everyselection stage by at least 4%. After 10,000 labelled exam-ples, the CoreGCN achieves state-of-the-art with 90.7% ,the highest reported value in the literature (Yoo and Kweon2019; Sinha, Ebrahimi, and Darrell 2019) under this config-uration. The performance obtained with a fifth of DU bringsan important factor to the research community in whetherlabelling 40,000 more images is needed for a 2.4% gain.UncertainGCN comes close to CoreSet achievements overthe testing set, but it is limiting in latter stages due to theselection from confused feature-space areas. This aspect isdemonstrated in the qualitative analysis.

    Following the quantitative study on CIFAR-100 Figure2 (right), we indicate that our proposed methods can scaleto a more diverse dataset maintaining their state-of-the-artresults. With 40% of labelled data, we achieve 69% accu-racy by applying CoreGCN (4% less than from the entiredataset). Once again, the GCN capacity of passing informa-tion within the feature domain helps furthermore the activelearning sampling. Compared to CIFAR-10, VAAL shows abetter trend against random sampling, but it still falls underour performances. The reason is that the VAE might requirea larger batch of queries (1,000 more) to differentiate. Onthe other side, although Learning Loss outperforms VAALon the other datasets, for CIFAR-100 the mean averaged per-formance is similar.

    We continue the image classification analysis on Fashion-

  • CIFAR-10 feature distribution - CoreSet [first stage]Selected samples

    CIFAR-10 feature distribution - UncertainGCN [first stage]Selected samples

    CIFAR-10 feature distribution - CoreSet [fourth stage]Selected samples

    CIFAR-10 feature distribution - UncertainGCN [fourth stage]Selected samples

    Figure 4: Exploration comparison on CIFAR-10 betweenCoreSet and UncertainGCN (Zoom in the view)

    MNIST and SVHN by illustrating performance progressionsin Figure 3. The challenge in those datasets consists in thehigh accuracy already obtained from the initial random setD0. This happens due to the ResNet-18’s discriminatory ca-pability and because of the saturation within the data space.In the FashionMNIST experiment, apart from VAAL, Learn-ing Loss and FeatProp, all of our selection methods followsimilar curves over-passing random at every query stage. Weshow that there are no obvious differences between the Un-certainGCN and CoreGCN performances for greyscale data,especially with low-complexity features. We notice the samebehaviour for the SVHN dataset. The mean averaged accu-racy plateaus after 6,000 labelled points around 92.5% forCoreGCN, while CoreSet and UncertainGCN follow up inthe next stages. In this set of experiments, we observe thatour GCN selection approach exceeds in terms of mean aver-aged accuracy in comparison to existing model-based meth-ods like VAAL and Learning Loss.

    From all datasets, we observe that the FeatProp method,designed for GCN learners, is not suitable in image classi-fication tasks. The accuracy shows promising results for thefirst selection stages, but then it gains minimal improvementon CIFAR-10, CIFAR-100 and FashionMNIST, falling be-hind random sampling. A similar plateauing effect can benoticed in (Wu et al. 2019) on the PubMed dataset. Overallcomparisons demonstrate the clear superior performance ofthe proposed method compared to the recent arts in activelearning.Qualitative comparisons. We assume that ideally, the se-lection method aims to uniformly sample from all the classesat the initial annotation stages. Once the learner becomesmore confident, the acquisition function should exploremore samples from uncertain areas. Because in our activelearning framework the exploitation factor is fixed to ei-ther 1,000 or 2,000, we will observe the exploration aspectthrough t-SNE (van der Maaten and Hinton 2008) represen-tations. We compute embedding of the ResNet-18 CIFAR-10 features from both labelled and unlabelled samples forthe first and fourth query stage.

    In Figure 4, the t-SNE algorithms clusters the imagesof the 10 categories for both CoreSet and UncertainGCN.From an exploration perspective, UncertainGCN targets out-of-distribution samples and areas where features are hardlydistinguished. We select nodes with label confidence val-ues higher than smargin = 0.1. The reason is to select theleast confident unlabelled samples that are found closer to1. Furthermore, the uncertainty is inherited by the GCN bi-

    CIFAR-10 feature distribution - CoreSet [first stage]Selected samples

    CIFAR-10 feature distribution - CoreGCN [first stage]Selected samples

    CIFAR-10 feature distribution - CoreSet [fourth stage]Selected samples

    CIFAR-10 feature distribution - CoreGCN [fourth stage]Selected samples

    Figure 5: Exploration comparison on CIFAR-10 betweenCoreSet and CoreGCN (Zoom in the view)

    1000 2000 3000 4000 5000Number of labeled images

    50

    55

    60

    65

    70

    75

    80

    Accu

    racy

    (mea

    n of

    5 tr

    ials)

    Testing accuracy on CIFAR-10 UncertainGCN confidence smargin

    smargin = 0.05smargin = 0.1smargin = 0.2smargin = 0.3smargin = 0.4

    1000 2000 3000 4000 5000Number of labeled images

    50

    55

    60

    65

    70

    75

    80

    85

    Accu

    racy

    (mea

    n of

    5 tr

    ials)

    Testing accuracy on CIFAR-10 GCN loss bias

    = 0.6 = 0.8 = 1 = 1.2 = 1.4

    Figure 6: UncertainGCN tuning parameters (Zoom in theview)

    nary classifier through its inability to correlate the multi-class nodes to the annotation information. As mentionedin the quantitative part, Figure 4 (right) shows the concen-trated uncertain areas of this method. Since at first stagesthe features are concentric, the GCN will sparsely spreadthe messages between the nodes. In Figure 5, we continuethe qualitative investigation for the CoreGCN acquisitionmethod. We can observe that starting at the first selectionstage, CoreGCN avoids over-populated areas while track-ing out-of-distribution unlabelled data. Compared to Uncer-tainGCN, the geometric information from CoreGCN main-tains a sparsity throughout all the acquisition stages. Con-sequently, it preserves the message passing through the un-certain areas while CoreSet keeps sampling closer to clus-ter centres. This brings a stronger balance in comparison toCoreSet between in and out of distribution selection with theavailability of more samples.Comparison with CoreSet (Sener and Savarese 2018).Concluding from the qualitative analysis, we can state thatthe features distributed through GCN training tend to getpropagated in uncertain areas when new samples are se-lected with CoreGCN. However, this mechanism is notpresent in the CoreSet method while making it more vari-ant to the feature representations. Therefore, the accuracy ofCoreSet might decrease when changing the learner whilstour method will maintain robustness with GCN re-training(i.e. Figure 7).Ablation studies. In this part, we investigate what is the im-pact of the GCN hyper-parameters and its uncertainty-basedselection variables on the performance. Furthermore, giventhe current active learning framework, we observe the effectof using different features by changing the learner’s architec-ture from ResNet-18 to VGG-11 (Simonyan and Zisserman2015).

    While varying the architectural parameters of the GCNbinary classifier, we encountered a poorer selection with theincrease of the Dropout rate from 0.3 to 0.5 or 0.8. However,

  • Figure 7: Learner comparison [Resnet-18 vs VGG-11] -4,000 CIFAR-10 labelled images [mean averaged accuracyon 5 trials]

    when changing the size of the hidden units to 256 and 512,the UncertainGCN sampling was not affected on CIFAR-10.This might require further optimisation for different datasetsalthough robustness is being shown.

    The main principle in UncertainGCN is focused on un-certainty sampling. In section , we defined the parametersmargin to tune according to the GCN binary classification.Given the expected scores of 0 for confident unlabelled sam-ples, we are interest in the uncertain ones that yield scorescloser to 1. Figure 6 (left) confirms that by showing the bestaccuracy when smargin is 0.1. Another key parameter inboth UncertainGCN and CoreGCN query functions is λ, theloss bias between the labelled and unlabelled data points. Weobserve in Figure 6 (right) that the mean averaged accuracyon CIFAR-10 slightly improves in first stages when the biasis shifted to the unlabelled loss by 20%.

    In Figure 7, we modified the architecture of the learnerfor CIFAR-10 experiment to VGG-11 for analysing how theacquisition functions are affected in terms of accuracy at thefourth sampling stage. In training the VGG-11 network, wekept the same hyper-parameters. We also had to trace thefeatures after the first four Max Pooling layers for the Learn-ing Loss baseline. Our proposed methods present robustnessto this change, however settings were left unchanged. Hence,they surpass all state-of-the-arts at this early stage. This alsodemonstrates how the batch size and the feature representa-tion play an important role in the performances of the otherbaselines. A representation of all selection stages has beenadded to the supplementary.Sub-sampling of Synthetic Data. Unlike previous experi-ments of sub-sampling real images, we applied our methodto select synthetic examples obtained from StarGAN (Choiet al. 2018) trained on RaFD (Langner et al. 2010) face ex-pressions manipulations. Although Generative AdversarialNetworks (Goodfellow et al. 2014) are closing the gap be-tween real and synthetic data (Ravuri and Vinyals 2019), thesynthetic data and their associated labels are not yet suit-able to train a downstream discriminative model. Recentstudy (Bhattarai et al. 2020) recommends sub-sampling thesynthetic data before augmenting to the real data. Figure 8shows the performance comparison of random selection vsour method in 5 trials. From the experiments, we can ob-

    0 1000 2000 3000 4000Number of synthetic images

    78

    80

    82

    84

    86

    88

    90

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on RaFD

    RandomUncertainGCN

    Figure 8: Performance comparison on sub-sampling syn-thetic data to augment real data for expression classification(Zoom in the view)

    serve our method out-performs the random selection by ob-taining higher accuracy with less variance. The mean accu-racy drops when the ratio of synthetic examples increasesdue to a large portion of GAN-synthetic examples beingnoisey (Bhattarai et al. 2020).Limitations. In terms of computational efficiency, we ac-knowledge that the parameters of the GCN grow in thequadratic order to the number of nodes. This presents certainlimitation which may be tackled either through prior sub-sampling of the unlabelled pool (as in this work) or throughinductive learning for sub-graphs (GraphSAGE (Hamil-ton, Ying, and Leskovec 2017), GraphSAINT (Zeng et al.2020)). Although it is beyond the scope of this paper, weevaluated the number of the parameters of VAAL and Learn-ing Loss in comparison to ours for the first query stage asthey fall within the same active learning category. For se-lecting samples with VAAL, a variational auto-encoder anda discriminator need to be trained in an adversarial manner.Thus, the total number of parameters for their proposed ar-chitecture is 23,115,204. Although VAAL claims efficientsampling speed, our GCN sampling technique requires only82,307. On the other hand, the Learning Loss module with123,905 parameters has to be trained together with thelearner affecting the overall training time. Moreover, spe-cific deep learning architectures have to be deployed for thismethod where its complexity varies with the dimensions ofthe multi-features. Conclusively, our method can be faster incertain scenarios when comparing to the other model-basedalgorithms. All these active learning frameworks scale up interms of complexity with more labelled images.

    ConclusionsWe have presented a novel methodology of active learn-ing in image classification using Graph ConvolutionalNetwork. After systematical and comprehensive experi-ments, our adapted sampling techniques, UncertainGCN andCoreGCN, produced state-of-the-art results on four bench-marks. We have shown through qualitative distributions thatour selection functions maximises informativeness withinthe data space. The design of our sampling mechanismpermits integration into other learning tasks. Furthermore,this approach enables further investigation in this directionwhere both uncertain and geometric methods could be com-bined.

  • AcknowledgementsThis work is partially supported by Huawei Technolo-gies Co. and by EPSRC Programme Grant FACER2VM(EP/N007743/1).

    Supplementary MaterialCIFAR-10 imbalanced dataset. In the experimental part,we evaluated quantitatively in a systematic manner theactive learning methods over four image classificationdatasets. Although, before selection, we randomise the unla-belled samples to a subset, the dataset is still relatively bal-anced to each class distribution. However, this is not com-monly the case where there is no prior information relatedto the data space. Therefore, we are simulating an imbal-anced CIFAR-10 in a quantitative experiment. Beforehandwe considered the 50,000 training set as unlabeled, given5,000 samples for each of the 10 categories. We custom thedataset so that 5 of the 10 classes contain 10 % of their orig-inal data (500 samples each). Therefore, the new unlabelledpool is composed of 27,500 images. The experiment archi-tecture and settings are similar to the one on the full scale.

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Number of labeled images

    40

    50

    60

    70

    80

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on CIFAR-10 imbalanced

    RandomUncertainGCNVAALLearning LossCoreSetCoreGCN

    Figure A: Quantitative results - CIFAR-10 imbalanceddataset

    Figure A shows the progressions of the acquisition func-tions presented. Our proposed methods, UncertainGCN andCoreGCN, out-stand once again the other model-based se-lections like VAAL and Learning Loss. UncertainGCNscores 2% more than those methods with 80.05% meanaverage accuracy at 10,000 labelled samples. Meanwhile,CoreGCN achieves 84.5% top performance together withCoreSet. Thus, the geometric information is more useful inscenarios where the dataset is imbalanced.

    Ablation study - GCN parameter search Figure Bshows the hyper-parameter studies. With the range of443,520 hyper-parameters, the variance of mean perfor-mance is maximum 3.5% at 2000 samples. Hence, we be-lieve that our method is robust with the change of hyper-parameters. From a practical perspective, setting up the ac-tive learning framework proposed becomes straight forwardwhile following the minimum guidelines presented in theexperimental part.

    1000 2000 3000 4000 5000Number of labeled images

    50

    55

    60

    65

    70

    75

    80

    85

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on CIFAR-10 GCN Hyper-parameters

    Drop rate = 0.3 Hidden units = 128 Drop rate = 0.5 Hidden units = 128 Drop rate = 0.8 Hidden units = 128 Drop rate = 0.5 Hidden units = 256 Drop rate = 0.5 Hidden units = 512

    Figure B: Ablation studies - CIFAR-10 GCN Hyper-parameters tuning

    VGG-11 learner for CIFAR-10 image classification for 3selection stages Figure C.

    1000 2000 3000 4000Number of labeled images

    45

    50

    55

    60

    65

    70

    75

    Acc

    urac

    y (m

    ean

    of 5

    tria

    ls)

    Testing accuracy on CIFAR-10 VGG-11

    RandomUncertainGCNVAALLearning LossCoreSetCoreGCN

    Figure C: CIFAR-10 Learner VGG-11 - 3 selection stages

    ReferencesAbel, R.; and Louzoun, Y. 2019. Regional based query ingraph active learning. 1906.08541v1.

    Beluch Bcai, W. H.; Nürnberger, A.; and Bcai, J. M. K. 2018.The power of ensembles for active learning in image classi-fication. In CVPR.

    Bhattarai, B.; Baek, S.; Bodur, R.; and Kim, T.-K. 2020.Sampling Strategies for GAN Synthetic Data. In ICASSP.

    Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; and Van-dergheynst, P. 2017. Geometric deep learning: going beyondeuclidean data. IEEE Signal Processing Magazine 18–42.

    Cai, H.; Zheng, V. W.; and Chen-Chuan Chang, K. 2017.Active Learning for Graph Embedding. 1705.05085v1.

    Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; and Choo,J. 2018. Stargan: Unified generative adversarial networksfor multi-domain image-to-image translation. In CVPR.

    Cortes, C.; and Vapnik, V. 1995. Support-vector networks.Machine learning 20(3): 273–297.

    Gal, Y.; and Ghahramani, Z. 2016. Dropout as a BayesianApproximation: Representing Model Uncertainty in DeepLearning. In ICML.

  • Gal, Y.; Islam, R.; and Ghahramani, Z. 2017. Deep BayesianActive Learning with Image Data. In ICML.

    Gao, L.; Yang, H.; Zhou, C.; Wu, J.; Pan, S.; and Hu, Y.2018. Active Discriminative Network Representation Learn-ing. In IJCAI, 2142–2148.

    Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In NeurIPS.

    Goodfellow, I. J.; Bulatov, Y.; Ibarz, J.; Arnoud, S.; andShet, V. 2013. Multi-digit Number Recognition from StreetView Imagery using Deep Convolutional Neural Networks.1312.6082v4.

    Hamilton, W. L.; Ying, R.; and Leskovec, J. 2017. InductiveRepresentation Learning on Large Graphs. In NeurIPS.

    Har-Peled, S.; and Kushal, A. 2005. Smaller Coresets forK-Median and k-Means Clustering. In SCG, 126134.

    He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residuallearning for image recognition. In CVPR, 770–778.

    Houlsby, N.; Huszár, F.; Ghahramani, Z.; and Lengyel, M.2011. Bayesian Active Learning for Classification and Pref-erence Learning. 1112.5745v1.

    Kingma, D. P.; and Lei Ba, J. 2015. ADAM: A Method forStochastic Optimization. In ICLR.

    Kipf, T. N.; and Welling, M. 2017. Semi-supervised classi-fication with graph convolutional networks. In ICLR.

    Kirsch, A.; Van Amersfoort, J.; and Gal, Y. 2019. Batch-BALD: Efficient and Diverse Batch Acquisition for DeepBayesian Active Learning. In NeurIPS.

    Kontorovich, A.; Sabato, S.; and Urner, R. 2016. ActiveNearest-Neighbor Learning in Metric Spaces. In NeurIPS.

    Krizhevsky, A. 2012. Learning Multiple Layers of Featuresfrom Tiny Images. University of Toronto .

    Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-agenet classification with deep convolutional neural net-works. In NeurIPS.

    Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D. H.;Hawk, S. T.; and Van Knippenberg, A. 2010. Presentationand validation of the Radboud Faces Database. Cognitionand emotion 24(8): 1377–1388.

    Lewis, D. D.; and Gale, W. A. 1994. A Sequential Algorithmfor Training Text Classifiers. In SIGIR, 312.

    Nair, V.; and Hinton, G. E. 2010. Rectified Linear UnitsImprove Restricted Boltzmann Machines. In ICML.

    Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1999.The PageRank Citation Ranking: Bringing Order to the Web.Technical Report 1999-66, Stanford InfoLab.

    Pinsler, R.; Gordon, J.; Nalisnick, E.; and MiguelHernandez-Lobato, J. 2019. Bayesian Batch Active Learn-ing as Sparse Subset Approximation. In NeurIPS.

    Pu, Y.; Gan, Z.; Henao, R.; Yuan, X.; Li, C.; Stevens, A.; andCarin, L. 2016. Variational autoencoder for deep learning ofimages, labels and captions. In NeurIPS.

    Ravuri, S.; and Vinyals, O. 2019. Seeing is not necessarilybelieving: Limitations of biggans for data augmentation. InICLR.Sener, O.; and Savarese, S. 2018. Active Learning for Con-volutional Neural Networks: A Core-set approach. In ICLR.Simonyan, K.; and Zisserman, A. 2015. Very Deep Con-volutional Network for Large-scale image recognition. InICLR.Sinha, S.; Ebrahimi, S.; and Darrell, T. 2019. VariationalAdversarial Active Learning. In ICCV.Tsang, I. W.; Kwok, J. T.; and Cheung, P.-M. 2005. CoreVector Machines: Fast SVM Training on Very Large DataSets. J. Mach. Learn. Res. 6: 363392. ISSN 1532-4435.van der Maaten, L.; and Hinton, G. 2008. Visualizing datausing t-SNE. JMLR.Wolf, G. 2011. Facility location: concepts, models, algo-rithms and case studies. In Contributions to ManagementScience, 331–333.Wu, Y.; Xu, Y.; Singh, A.; Yang, Y.; and Dubrawski, A.2019. Active Learning for Graph Neural Networks via NodeFeature Propagation. 1910.07567v1.Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:a Novel Image Dataset for Benchmarking Machine LearningAlgorithms. 1708.07747v2.Yoo, D.; and Kweon, I. S. 2019. Learning Loss for ActiveLearning. In CVPR.Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; andPrasanna, V. K. 2020. GraphSAINT: Graph Sampling BasedInductive Learning Method. In ICLR.Zhao, L.; and Akoglu, L. 2020. PairNorm: Tackling Over-smoothing in GNNs. ICLR .

    IntroductionRelated WorksMethodLearnerSequential GCN selection process

    ExperimentsConclusionsSupplementary Material