Analyzing Curriculum Learning for Sentiment Analysis along

Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 117–128April 19, 2021. ©2021 Association for Computational Linguistics

117

Analyzing Curriculum Learning for Sentiment Analysis along TaskDifficulty, Pacing and Visualization Axes

Anvesh Rao Vijjini* Kaveri Anuranjana*

International Institute of Information Technology - Hyderabad{vijjinianvesh.rao,kaveri.anuranjana}@research.iiit.ac.in

[email protected]

Radhika Mamidi

Abstract

While Curriculum Learning (CL) has recentlygained traction in Natural language Process-ing Tasks, it is still not adequately analyzed.Previous works only show their effectivenessbut fail short to explain and interpret the in-ternal workings fully. In this paper, we an-alyze curriculum learning in sentiment anal-ysis along multiple axes. Some of theseaxes have been proposed by earlier worksthat need more in-depth study. Such analy-sis requires understanding where curriculumlearning works and where it does not. Ouraxes of analysis include Task difficulty on CL,comparing CL pacing techniques, and quali-tative analysis by visualizing the movementof attention scores in the model as curricu-lum phases progress. We find that curricu-lum learning works best for difficult tasks andmay even lead to a decrement in performancefor tasks with higher performance without cur-riculum learning. We see that One-Pass cur-riculum strategies suffer from catastrophic for-getting and attention movement visualizationwithin curriculum pacing. This shows that cur-riculum learning breaks down the challengingmain task into easier sub-tasks solved sequen-tially.

1 Introduction

Learning in humans has always been a systematicapproach of handling the fundamentals first andthen learning incrementally harder concepts. Cog-nitive Science has established that this leads toa clearer, robust understanding and the most effi-cient learning (Krueger and Dayan, 2009; Avra-hami et al., 1997). Indeed, something similar canbe applied while training neural networks. (Bengioet al., 2009) show that Curriculum Learning (CL)- sampling data based on increasing order of diffi-culty leads to quicker generalization. (Weinshall

*The authors contributed equally to the work.

et al., 2018) also demonstrate that CL increases therate of convergence at the beginning of training.Their CL strategy involved sorting the training databased on transfer learning from another networktrained on a larger dataset. The idea of reorderingsamples has been explored in various approaches.In this paper, we evaluate the “easiness” with anetwork and train the samples on another network.Hence, even in our case, we shall pick easier pointsregarding a target hypothesis then train another net-work that optimizes its current hypothesis. Thisidea has been suggested by previous works as well.(Hacohen and Weinshall, 2019; Weinshall et al.,2018).

(Cirik et al., 2016) proposed Baby Steps andOne Pass curriculum techniques using sentencelength as a curriculum strategy for training LSTM(Hochreiter and Schmidhuber, 1997) on SentimentAnalysis. A tree-structured curriculum orderingbased on semantic similarity is proposed by (Hanand Myaeng, 2017). (Rao et al., 2020) proposean auxiliary network that is first trained on thedataset and used to calculate difficulty scores forthe curriculum order.

CL is also used in NLP within tasks such asQuestion Answering (Sachan and Xing, 2016,2018) and NLG for Answer Generation (Liu et al.,2018). For Sentiment Analysis, (Cirik et al., 2016)propose a strategy derived from sentence length,where smaller sentences are considered easier andare provided first. (Han and Myaeng, 2017) pro-vide a tree-structured curriculum based on seman-tic similarity between new samples and samplesalready trained. (Tsvetkov et al., 2016) suggesta curriculum based on handcrafted semantic, lin-guistic, syntactic features for word representationlearning.

Some of these works (Cirik et al., 2016; Han andMyaeng, 2017; Rao et al., 2020) have suggestedthat Baby Steps performs better than One Pass. We

118

perform experiments using both techniques. Whilethe idea of curriculum remains the same acrossthese works, the strategy itself to decide sampleordering is often diverse.

2 Axis I: Curriculum Learning: OnePass and Baby Steps

While Curriculum Learning as defined by (Bengioet al., 2009) is not constrained by a strict descrip-tion, later related works (Cirik et al., 2016; Han andMyaeng, 2017; Spitkovsky et al., 2010; Rao et al.,2020) make distinctions between Baby Steps cur-riculum and One-Pass curriculum. Most of theseprevious works have also shown the dominance ofBaby Steps over One-Pass. Baby Steps and OnePass curriculum can be defined as follows. Forevery sentence si ∈ D, its sentiment is describedas yi ∈ {0, 1, 2, 3, 4}, where i ∈ {1, 2...n} for ndata points in D. For a model fw, its predictionbased on si will be fw(si). Loss L is defined on themodel prediction and actual output as L(yi, fw(si))and Cost defining the task as C(D, fw) as∑

∀i

1

nL(yi, fw(si)) (1)

Here, curriculum strategy S(si) defines an “eas-iness”/“difficulty” quotient of sample si. Further-more, One Pass makes distinct, mutually exclusivesets of the training data and trains on each oneof these sets one by one. This makes it faster ascompared to Baby Steps, where data cumulativelyincreases in each pass. This implies that the modelis trained on previous data and the additional harderdata.

To analyze the two methods for executing CL wechoose two curriculum strategies (difficulty scor-ing function). Furthermore, we also experimentwith an individual setting explained in followingsections.

2.1 DatasetFollowing previous works in curriculum-drivensentiment analysis (Cirik et al., 2016; Han andMyaeng, 2017; Tsvetkov et al., 2016; Rao et al.,2020) We use the Stanford Sentiment Treebank(SST) dataset (Socher et al., 2013). Unlike mostsentiment analysis datasets with binary labels,SST is for a 5-class classification consisting of8544/1101/2210 samples in train, development,

Our dataset has 5 labels.https://nlp.stanford.edu/sentiment/

and test set. We use this standard split with re-ported results averaged over 5 turns.

2.2 Model Details: BERTWe use the popular transformer modelBERT(Devlin et al., 2019) for our experi-ments due to how ubiquitous it is across naturallanguage processing tasks. Bidirectional EncoderRepresentations from Transformers (BERT)is a masked language model trained on largecorpora. A sentence is added with a special token(CLS) at the beginning and is passed into thepre-trained BERT model. It tokenizes the sentencewith a maximum length of 512 and outputs acontextual representation for each of the tokenizedwords. There are variants of pre-trained BERTdepending upon the hyper-parameters of the model.BERT-Base Uncased consists of 12 transformerencoders, and output from each token is a 768dimension embedding. We apply a softmax layerover the first token position (CLS) output by theBERT model for sentiment analysis.

2.3 Curriculum Strategies2.3.1 Auxiliary Model StrategyAuxiliary Model Strategy is based on previousworks (Weinshall et al., 2018; Hacohen and Wein-shall, 2019; Rao et al., 2020) which propose a dif-ficulty scoring function, transfer learned from ananother network. We first train an auxiliary modelAux for sentiment analysis on the same dataset.This Aux model architecture will be the same asthe model finally used for CL. This allows us tofind out which training samples are actually diffi-cult. We learn what samples are the most difficultto classify and what are the easiest from this model.For all training samples of D, we define the cur-riculum score as follows:

S(si) =c∑j

(Aux(si)j − yji )

2 (2)

where Aux(si)j is the prediction of auxiliary

model Aux on sentence si, j is the iterator over thenumber of classes c = 5. In essence, we find themean squared error between the prediction and thesentence’s true labels. If S(si) is high, it impliesthe sentence is hard to classify, and if less, thenthe sentence is easy. Because the features weretrained on an auxiliary model from BERT features,we get an easiness-difficulty score purely from theperspective of sentiment analysis. Section A gives

119

Figure 1: Performance comparison shows weakness of One Pass.

some examples of difficult and easy samples ac-cording to this curriculum.

2.3.2 Sentence LengthThis simple strategy tells that architectures, es-pecially LSTM find it difficult to classify longersentences. Hence, longer sentences are difficultand should be ordered later. Conversely, shortersentence lengths are easier and should be trainedfirst. This strategy is prevalent and has not onlybeen used in sentiment analysis (Cirik et al., 2016)but also in dependency parsing (Spitkovsky et al.,2010). This is why it becomes a strong compar-ison metric, especially to evaluate the distinctionbetween One Pass and Baby Steps.

2.3.3 IndividualIn this strategy, we report accuracy on the test setwhen trained on D1,D2, and so on individually.This is by no means a curriculum strategy since nomodel ever sees the complete training data. TheIndividual experiment can be thought of as OnePass, but the weights are reset after every trainingphase.

2.4 Results and DiscussionTable 1 and Figure 1 give us an insight into what ishappening with One Pass. Firstly, Table 1 clearlyshows that One Pass underperforms both No cur-riculum and Baby Steps for both the CurriculumStrategies. This is in line with previous experi-ments as well (Cirik et al., 2016; Han and Myaeng,2017; Rao et al., 2020). One Pass is less time-consuming since it only observes a sample once,

(Cirik et al., 2016) have done on CL SST as well. How-ever, our numbers do not match because they use the muchlarger phrase dataset.

unlike Baby Steps which repeatedly sees samplesfrom the previous and the current phase. Further-more, we see that auxiliary outperforms sentencelength. Finally, Figure 1 illustrates the reason forOne Pass’s weakness. We observe that on succes-sive phases, One Pass closely follows the curveof the proposed Individual experiment. However,unlike One Pass, Individual has no memory of sam-ples at previous stages. This shows that in everyphase of One Pass, the model forgets the previousstage samples and effectively behaves like Indi-vidual, hence catastrophically forgetting previousphases. Furthermore, this is especially a majorissue for the myriad of methods involving the Lan-guage Model pre-training and downstream tasksfine-tuning Paradigm. Language Model backedtransformers (Vaswani et al., 2017) are now ubiqui-tous for tasks across Natural Language Understand-ing (Devlin et al., 2019; Liu et al., 2019; Yang et al.,2019; Lan et al., 2020; Raffel et al., 2020; Choud-hary et al., 2020). These architectures are trained ina language model task first, followed by fine-tuningon the downstream task. In this regard, they are sim-ilar to the One Pass strategy since they successivelytrain on disjoint datasets. Hence, problems withOne Pass, such as catastrophic forgetting, are likelyto occur with these architectures as well. Addition-ally, while Baby Steps addresses the catastrophicforgetting in One Pass, it will be harder to addresscatastrophic forgetting in Language Models. Thedownstream task objective in the CL setting is dif-ferent from the Language Model objective makingjoint training harder, unlike Baby Steps.

120

Model Curriculum Strategies

Baby Steps One Pass No Curriculum

Auxiliary Strategy 52.3 24.15 50.03

Sentence Length Strategy 51.2 23.64 50.03

Table 1: Accuracy scores in percentage of the two curriculums across Baby Steps and One Pass as compared to nocurriculum on the SST5 dataset.

2.4.1 Visualizing Catastrophic ForgettingTo visualize Catastrophic Forgetting, Figure 2 plotsthe BERT model’s performance. The figure illus-trates the corresponding match or mismatch be-tween model prediction and ground truth ratherthan just the model prediction. In this figure, thecorrectness of model prediction across all test sam-ples is illustrated for every phase of the One Passand Baby Steps methods. The samples are verti-cally stacked along the y-axis in an order based onthe number of phases sample missclassified in. Theconsecutive phases of the curriculum training areindicated on the x-axis.. A darker color (value of“0”) indicates miss-classification, and brighter orlighter color (value of “1”) indicates that the sampleis correctly classified. Note that the classificationtask itself is not a binary task but a multi-class clas-sification problem. For SST5, this is a five-classclassification task. Figure 2 edifies the followingpoints.

• Easy and Hard Samples: In both the figures,some sections are always dark or always lightin color. The difficulty of certain samplesis consistent irrespective of model training.The accuracy is actually determined by thesamples with intermediate difficulty whoseprediction can fluctuate as the model observesmore data encounters.

• Memory in Baby Steps: The figure for BabySteps shows the model fairly rememberingwell the concepts it is trained on. Here, oncethe model prediction is corrected, it mostlystays corrected. Hence, implying a “memory”in the model.

• Catastrophic Forgetting in One Pass Con-trary to Baby Steps, One Pass heavily suf-fers from catastrophic forgetting, which isobserved every time the color of a sample

As discussed earlier, we take k = 5; hence we end upwith 5 phases on the x-axis.

changes from lighter to darker. In One Pass,in successive phases, the model makes morecorrect predictions in new regions. However,at the same time, samples become darker inthe regions earlier it was lighter in. Model per-formance decreases to the lowest in the finalstages because the model is the farthest fromall previous learning phases at this point.

• Visualizing Test Performance: This visual-ization is done on the unseen test set. Ourabove hypotheses are still natural to under-stand if visualized on the train set. However,the samples in the visualization are always un-seen during the training. This implies that themodel forgets or remembers training samplesin the corresponding phases and the associatedconcepts as well.

3 Axis II: Curriculum Learning onlyhelps Difficult tasks

(Hacohen and Weinshall, 2019) and (Xu et al.,2020) propose that CL might help tasks which areharder than easier. They observe that for taskswhich have low performance, the difference in per-formance caused due to introducing a curriculumtraining is more than the tasks which have a higherperformance without curriculum. We call this theTask Difficulty Hypothesis.

(Hacohen and Weinshall, 2019) perform animage classification experiment where they ex-periment with enabling curriculum on the CI-FAR dataset(Krizhevsky et al., 2009) with 100classes(CIFAR-100) and 10 classes(CIFAR-10) astwo separate experiments. They use VGG net-work(Simonyan and Zisserman, 2014) as the com-mon neural network for these experiments. Nat-urally, they report performance of CIFAR 10 inthe range of 90 to 100% (hence, an easy task) andCIFAR-10 in the range pf 60 to 70 % (hence, ahard task). On addition of a curriculum trainingto these datasets, they observe that the increment

121

Figure 2: Catastrophic Forgetting in One Pass Visualized. In the above images, the model performance on eachindividual test set sample is illustrated across all curriculum phases. These phases correspond with how much datathe model has observed so far.

in performance on CIFAR-100 is almost twice theincrement in performance on CIFAR-10 while us-ing the same network VGG-net for training. Theyargue that this might be because in easier taskssuch as CIFAR-10, there are already enough easysamples observed during training without CL, andhence improvement caused by CL is subdued.

(Xu et al., 2020) enable their CL across the rangeof tasks in GLUE(Wang et al., 2018). GLUE en-compasses a wide range of tasks in natural lan-guage understanding, with varying performances.For example a task such as RTE or CoLA is consid-ered harder than SST-2 or QNLI. The BERT(Devlinet al., 2019) model’s performance for the same are70.1, 60.5, 94.9, 91.1 respectively .

We built our experiments upon these works. Pro-posed work is different from (Hacohen and Wein-shall, 2019) in the way that our work focuses onsentiment analysis in NLP, whereas (Hacohen andWeinshall, 2019) experimented strictly with imageprocessing tasks. Furthermore, since the apparentrelation between task difficulty and improvementdue to CL wasn’t the main focus of the work, theexperimentation wasn’t enough to fully establishthe correlation. While (Xu et al., 2020) performexperiments within the purview of NLU and TextClassification, the experiments themselves are notconsistent across dataset and the nature of task.Specifically, we believe it’s hard to conclude from

As reported in the original paper (Devlin et al., 2019)

experiments on tasks as disparate as Linguistic Ac-ceptability(Warstadt et al., 2019) and SentimentAnalysis(Socher et al., 2013) to establish the re-lationship between CL and task difficulty. Suchan experiment inadvertently raises more questionswhether the difference of improvement due to CL isdue to the nature of tasks (Sentiment Identificationas opposed to Linguistical soundness detection) orthe nature of the dataset being different (sentencelengths or vocabulary). We eschew this concernby performing experiment within the same task(Sentiment Analysis), and the same dataset (SST-5), instead we introduce change in task difficultyby using the fine grained labels provided by thisdataset. This experimental setup will be explainedin more detailed in the following sections.

3.1 Dataset

To ensure no effect of nature of task or dataset onour experiments, we utilize just a single dataset butunder different conditions and sampling to simu-late difficulty. All these meta-datas are generatedusing the SST-5 (Socher et al., 2013) dataset. Itis important to note that we do not use the phrasedata for training which is why our scores may fallshort of earlier reported performances of BERT onSST5 and SST derived datasets (Devlin et al., 2019;Liu et al., 2019; Xu et al., 2020). The fine-grained5 classes of this dataset enable us to create easier

https://nlp.stanford.edu/sentiment/

122

With Curriculum No Curriculum DifferenceSST-2 89.95 91.11 − 1.6SST-3 75.47 75.52 − 0.05SST-4 62.04 61.72 + 0.32SST-5 52.3 50.03 + 2.0

Table 2: Accuracy scores in percentage on all the cre-ated meta-data with and without curriculum. Resultsare obtained by averaging over 5 runs.

variants of the same dataset without any significantchange in the problem statement definition and thedataset statistics. We generate four datasets fromSST5: SST-2, SST-3, SST-4 and SST-5 itself. InSST-3 and SST-5 neutral label is preserved, other-wise dropped. The two negative labels and the twopositive labels are clubbed in SST-3 and SST-2.

In each of the SST-x dataset, the train, test anddev sets are all converted accordingly. The modelin all cases is strictly trained on the train datawith development as validation and results are re-ported on test set. This way the comparison can befair since dataset specific effects on the hypothesiswould not be pertinent.

Humans are likely to observe sentiment polarityof a natural language sentence on a continuousscale rather than distinct classes. Hence, it makesnatural sense that making a distinction between,Very Positive and Positive is a harder task than justmaking a distinction between positive and negative.

3.2 Curriculum Training

We use the Auxiliary Model Strategy (Section2.3.1) coupled with Baby Steps (Section 2) as ourcurriculum training technique and BERT as themodel architecture. We use Baby Steps (Cirik et al.,2016) as the curriculum pacing since we observedit to have best performance.

3.3 Results and Discussions

Our Experiments are illustrated in figure 3 and table2. Following are the major points to be noted

• Number of classes is detrimental to modelperformance We observe that in both withcurriculum and No curriculum settings, modelperformance varies tremendously betweenSST-5 and SST-2. This is also evident in pre-vious works which have shown performanceranges on SST to lie in the ranges of 45-50%(Cirik et al., 2016) for a 5 class problem andin the order of 90%(Lan et al., 2020; Devlin

Figure 3: Accuracy scores in percentage on all the cre-ated meta-data with and without curriculum. The differ-ence has been plotted on the secondary y-axis (Right)

et al., 2019; Xu et al., 2020) for a binary classproblem. Furthermore, it is interesting to notethat the performance gap between SST-i andSST-i+ 1 decreases as number of classes in-crease. This implies while adding classesmakes the task harder, the hardness is lessrelevant if already many classes.

• Curriculum has adverse effects on highperforming models The most striking pointto be observed is that curriculum actually doesnot help in certain cases. In SST-2 and SST-3,the model is already performing quite wellwith accuracy of 91.11 and 75.52 respectivelyIn these cases the model observes a decre-ments= of 1.6 and 0.05 % respectively dueto CL. The performance on SST-2 is in linewith previous work (Xu et al., 2020) who alsoobserve a slight decrement to no incrementin this dataset for their own curriculum tech-nique. Tasks which are having performanceare in essence, possessing higher number ofeasier samples than tougher tasks. In sucha condition, A model observing these easysamples again and again (as we saw earlierin Baby Steps), might lead to overfitting. Webelieve this to be the reason behind then ap-parent performance degradation.

• Curriculum has positive effects on low per-

123

forming models Despite low performance onSST-2, the resourcefulness of CL is felt inSST-5 and SST-4 where there is a positive dif-ference. These tasks are harder in nature withNo Curriculum performance below 65%, andCL is able to improve the scores significantlywith an average of +1.16%. As we saw inprevious section, Hard samples are catastroph-ically forgotten all the time, hence trainingon these samples again and again will leadto an improvement. Unlike SST-2 and SST-3which have a huge proportion of easy samples,Hard tasks according have harder samples andhence instead repeated training in a systematicordering would not have an adverse effect.

Furthermore, (Xu et al., 2020) suggest a similarreasoning for the Task Difficulty hypothesis. Theysuggest that when learning without curriculum, inthe case of harder tasks, the model is confoundedand overwhelmed by the presence of hard samples.Hence, laying out the training where the modelobserves easier samples first is natural to improveperformance. However, this reasoning does notexplain well why there is an apparent decrement ofperformance for CL on high performing tasks.

4 Axis III: Attention MovementVisualization

Visualization of Attention (Vaswani et al., 2017)is an important and popular visualization methodfor interpreting the prediction in these models. Pre-vious works such as (Clark et al., 2019) have vi-sualized attentions to identify where does BERTlook. (Vaswani et al., 2017) also show in attentionvisualization that various heads of the transformerlook at various linguistic details. In this section weuse attention visualization to qualitatively analyzeand explain how the model’s focus on the sentencechanges as the various stages (when model encoun-ters new and harder data) of CL progresses.

It is important to note that BERT has N layersand H heads. Within each of these NxH headsthere lies a TxT self attention from each input timestamp to every other time stamp. Where T is themaximum sentence length. Thereby, there are atotal of NxH 2D TxT Attention visualizationsobserved in BERT.

4.1 ExperimentsTo particularly identify how focus of the BERTmodel changes across multiple stages or phases

of the curriculum, we devise an attention move-ment visualization. We define Attention Movementindex as

(3)M(i, h, n) = A(i, h, n)−A(i−1, h, n) ∀i∈ {1, 2, 3..c}

where M is the proposed Movement index,A(i, h, n) is the Attention visualization of the hth

head of nth layer after training ith phase of curricu-lum training and c is the total number of phasesin the curriculum training. We can see that a posi-tive M(i) indicates that across subsequent stagesof the curriculum, the model has added attention orincreased focus in the area.

Conversely, a negative M(i) indicates that aftercompleting phase i model has decided to attend tocertain area or region lesser than before the phase’straining. Essentially the Movement indicates rela-tive change in attention rather than attention itself.This is significant because, unlike typical visual-ization where we observe a singular model in iso-lation, here we are attempting to analyze how themodel behaves in subsequent stages of the trainingin a curriculum fashion and furthermore, how thischange effects the prediction label.

Additionally, a point to note is that while previ-ous works such as (Clark et al., 2019) establish thatBERT’s distinct heads attend on linguistic notionsof syntax among other linguistic ideas, we do not vi-sualize these attention heads themselves. In this ex-periment, we are looking for an overall perspectiveinto where there is a positive or negative change inthe attention focus rather than understanding theindividual linguistic nuances of individual heads.For this reason we further propose an AveragedMovement index as follows:

Mavg(i) =

H,N∑h=2,n=2

M(i, h, n) (4)

where, Mavg(i) is Averaged Movement indexacross all heads and layers after the ith curricu-lum training phase. h and i are iterators h and nare iterators over the total number of heads H andtotal number of layers N in the BERT Transformer.Total number of curriculum phases are c. Since, Mis a difference between subsequent phases, thereare a total of c− 1 Average Movement index visu-alizations of size TxT .

Other experimental details such as dataset, cur-riculum strategy, baby steps, model are all same

124

Figure 4: Attention Movement Index (Mavg(i)) visualization for the first and second sentence. The matrices showhow model’s focus changes for the input sentence between the phases. Blue color implies attention was addedduring phase transition while red implies attention diminished for those words.

as of the experiments explained in Section 3. Torecapitulate, we use SST-5 dataset with AuxiliaryModel Technique in Baby Steps training procedure.The c in Baby Steps training is 5 in our experi-ments.

4.2 Results and DiscussionsFigures 4 gives examples for Mavg scores for threesentences. There are a total five phases in our train-ing and hence total four Mavg scores per sentencewith each score being TxT in attention size withT as maximum sentence length. In these images,a blue color implies there was a addition of atten-tion or focus in the region and red or a negativevalue implies there was a negation in the model’sfocus on the area. Almost no color or white colorindicates no change in attention in the region. Themovement index visualizations is shown along thedirection of the arrow. All the following exam-ples are from the Testing set, after the curriculumtraining, the intermediate phase wise models wereextracted and made to predict on the individual testsentences. Furthermore, please note that for eachof the TxT attentions, y-axis denotes the input po-sitions and x-axis denotes the output positions ofthe self attention. Our analysis can be listed foreach sentence as follows:

• In this sentence (True Label: Negative),

“static , repetitive , muddy and blurry , heyarnold !” the model predicts incorrectly inthe beginning and continues to mislabel untilthe third phase of training. We observe thaton finishing the third phase of the trainingthe movement index (2nd image from top),shows a motion of higher focus on “staticrepetitive muddy” (and lower focus from re-dundant words. While in the first and secondphase when the model observed just a sam-ple of easy data, the movement of attentionwasn’t in any direction that could help theclassification. The motion in third phase how-ever shows the addition of focus on wordswhich could help the model classify the sen-tence as negative. Furthermore, until the lastphase we also see an addition of focus to theterms “hey arnold !”, this could model’s wayof focusing on some neutral words to avoidpredicting Very Negative as opposed to Nega-tive. Essentially, CL has divided up the taskof predicting sentiment into multiple sectionswhere the model first learnt the neutrality from“hey arnold” followed by focusing the polarityterms to predict just negative. This makes theprediction easier and hence effective than nocurriculum.

125

• In this sentence (True Label: Positive), “it ’sa worthwhile tutorial in quantum physics andslash-dash” the model predicts incorrectly un-til model’s completion of Phase four training.In phases two and three we see that modelis adding weight to some neutral words suchas “it ’ s a” and “slash-dash”, hence ends uppredicting neutral which is an incorrect pre-diction. However, After observing phase foursamples, model finally increases attention tothe actual polarity indicating words in the sen-tence “worthwhile”. and hence shifts predic-tion appropriately to Positive. There is a pos-sibility that since “worthwhile” does not existin BERT’s vocabulary it might have found ithard to map the word to positive polarity un-til enough training data was observed by themodel in subsequent phases.

In all the above examples, there is a common pat-tern, which is that CL spreads out the sentimentprediction process. Training on initial easier sam-ples rules out whether prediction is correct.

5 Conclusion

In this paper we conduct experiments to analyzeCL. We first hypothesize the failure of One Passfor Curriculum learning in text classification. Weconduct an Individual non curriculum experiment,to show that One Pass heavily suffers with catas-trophic forgetting. Unless this forgetting is tackled,One Pass will continue to under perform even nocurriculum settings. Then, we analyze CL alongTask Difficulty Hypothesis to establish firmly thatCL only helps when coupled with difficult tasks.The scope of improvement may diminish or even benegative for tasks which are already easy. Finallywe propose movement visualizations to analyseCL. We observe that for hard examples, CL breaksdown the task into multiple phases. These phasesin a way rule out certain sections of the sentence forsentiment prediction first, such that when the modelencounters harder data in later phases, it is clear onwhat not to focus on. Furthermore, after both theexperiments, we can conclude that the reason CLmight be deleterious to performance in easier taskscould be because in these tasks, multiple phases areunnecessary. If the model prediction was alreadycorrect in the first couple of phases, then furtherphases may only move the prediction away henceleading to poor performance.

ReferencesJudith Avrahami, Yaakov Kareev, Yonatan Bogot, Ruth

Caspi, Salomka Dunaevsky, and Sharon Lerner.1997. Teaching by examples: Implications forthe process of category acquisition. The Quar-terly Journal of Experimental Psychology Section A,50(3):586–606.

Yoshua Bengio, Jerome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international confer-ence on machine learning, pages 41–48. ACM.

Nurendra Choudhary, Nikhil Rao, Sumeet Katariya,Karthik Subbian, and Chandan K Reddy. 2020. Self-supervised hyperboloid representations from logi-cal queries over knowledge graphs. arXiv preprintarXiv:2012.13023.

Volkan Cirik, Eduard Hovy, and Louis-PhilippeMorency. 2016. Visualizing and understanding cur-riculum learning for long short-term memory net-works. arXiv preprint arXiv:1611.06204.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D Manning. 2019. What does bert lookat? an analysis of bert’s attention. In Proceedings ofthe 2019 ACL Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP, pages276–286.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Guy Hacohen and Daphna Weinshall. 2019. On thepower of curriculum learning in training deep net-works. In International Conference on MachineLearning, pages 2535–2544. PMLR.

Sanggyu Han and Sung-Hyon Myaeng. 2017. Tree-structured curriculum learning based on semanticsimilarity of text. In 2017 16th IEEE InternationalConference on Machine Learning and Applications(ICMLA), pages 971–976. IEEE.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learn-ing multiple layers of features from tiny images.

Kai A Krueger and Peter Dayan. 2009. Flexible shap-ing: How learning in small steps helps. Cognition,110(3):380–394.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. Albert: A lite bert for self-supervised learning

126

of language representations. In International Con-ference on Learning Representations.

Cao Liu, Shizhu He, Kang Liu, and Jun Zhao. 2018.Curriculum learning for natural answer generation.In IJCAI, pages 4223–4229.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21:1–67.

Vijjini Anvesh Rao, Kaveri Anuranjana, and RadhikaMamidi. 2020. A sentiwordnet strategy for curricu-lum learning in sentiment analysis. In InternationalConference on Applications of Natural Language toInformation Systems, pages 170–178. Springer.

Mrinmaya Sachan and Eric Xing. 2016. Easy questionsfirst? a case study on curriculum learning for ques-tion answering. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages453–463.

Mrinmaya Sachan and Eric Xing. 2018. Self-trainingfor jointly learning to ask and answer questions. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 629–640.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Ju-rafsky. 2010. From baby steps to leapfrog: How lessis more in unsupervised dependency parsing. In Hu-man Language Technologies: The 2010 Annual Con-ference of the North American Chapter of the Associ-ation for Computational Linguistics, pages 751–759.Association for Computational Linguistics.

Yulia Tsvetkov, Manaal Faruqui, Wang Ling, BrianMacWhinney, and Chris Dyer. 2016. Learningthe curriculum with bayesian optimization for task-specific word representation learning. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 130–139.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel Bowman. 2018. Glue:A multi-task benchmark and analysis platform fornatural language understanding. In Proceedingsof the 2018 EMNLP Workshop BlackboxNLP: An-alyzing and Interpreting Neural Networks for NLP,pages 353–355.

Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2019. Neural network acceptability judgments.Transactions of the Association for ComputationalLinguistics, 7:625–641.

Daphna Weinshall, Gad Cohen, and Dan Amir. 2018.Curriculum learning by transfer learning: Theoryand experiments with deep networks. In Inter-national Conference on Machine Learning, pages5238–5246. PMLR.

Benfeng Xu, Licheng Zhang, Zhendong Mao, QuanWang, Hongtao Xie, and Yongdong Zhang. 2020.Curriculum learning for natural language under-standing. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 6095–6104.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In Advances in neural in-formation processing systems, pages 5754–5764.

A Difficult and Easy samples accordingto the Auxiliary Strategy

Table 3 shows some examples of difficult ad easysamples according to the Auxiliary Model Curricu-lum Strategy explained in section 2.3.1. As wecan observe, this curriculum is effective because itconsiders a sample as difficult in similar ways to ahuman discerning the sentiment as well.

B Additional Explanation for AttentionMovement Visualization

In the sentence from figure 5 (True Label: Neg-ative) “if this dud had been made in the ’70s , itwould have been called the hills have antlers andplayed for about three weeks in drive-ins .”, themodel predicts incorrectly until the fourth phase.After the second phase, the model predicts Positiveby increasing focus on parts of the sentence suchas “the hills” and “played for” which could havea slight positive (as opposed to very positive) con-notation in a movie review context. After the third

127

Figure 5: Attention Movement Index Visualizations for the sentence. The matrices show how model’s focuschanges for the input sentence between the phases. Blue color implies attention was added during phase transitionwhile red implies attention diminished for those words.

phase, the motion is more towards “the hills” and“dud” which have neutral and negative connotationrespectively leading to an overall sentiment of Neu-ral. Finally, after the fourth phase, the movementis most towards the only strongly negative word“dud”. Like the previous example, “dud” is notpart of BERT’s vocabulary and the only reason themodel choose to increase its attention here mustbe because it already explored other regions of thesentence in previous phases. Hence, without thisspecific ordering, model would have lower prob-ability of focusing on the broken subwords “du -##d”.In all the examples, there is a common pattern, thatcurriculum learning spreads out the sentiment pre-diction process. Training on initial easier samplesrules out incorrect predictions from parts of thesentence which do not contribute to the sentiment.This helps in later phases when model receives in-formation to focus on right segment of the sentence.In this stage, the model would be more confidentin its prediction since other regions of the sentencewere already explored and excluded from the pre-diction process. In essence, CL debases the proba-

bility of incorrect prediction by revealing the rightdata at the right time in the spread out curriculumform of training.

128

Label ExampleEasy 1. meeting , even exceeding expectations , it ’s the best sequel since the

empire strikes back ... a majestic achievement , an epic of astonishinggrandeur and surprising emotional depth .

(from first 50 samples) 2. the most wondrous love story in years , it is a great film .3. one of the best looking and stylish animated movies in quite a while ...

Hard 1. if the predictability of bland comfort food appeals to you , then the film isa pleasant enough dish .

(from last 50 samples) 2. it is a testament of quiet endurance , of common concern , of reconciledsurvival .’3. this movie is so bad , that it ’s almost worth seeing because it ’s so bad .’

Table 3: Examples of Difficult and Easy samples according to the Auxiliary Model Strategy

Documents

Analyzing Curriculum Learning for Sentiment Analysis along