Should You Fine-Tune BERT for Automated Essay Scoring? · 2020-06-16 · 1 Introduction Automated essay scoring (AES) mimics the judg-ment of educators evaluating the quality of student

Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, pages 151–162July 10, 2020. c©2020 Association for Computational Linguistics

151

Should You Fine-Tune BERT for Automated Essay Scoring?

Elijah Mayfield and Alan W BlackLanguage Technologies Institute

Carnegie Mellon [email protected], [email protected]

Abstract

Most natural language processing researchnow recommends large Transformer-basedmodels with fine-tuning for supervised clas-sification tasks; older strategies like bag-of-words features and linear models have fallenout of favor. Here we investigate whether, inautomated essay scoring (AES) research, deepneural models are an appropriate technologicalchoice. We find that fine-tuning BERT pro-duces similar performance to classical modelsat significant additional cost. We argue thatwhile state-of-the-art strategies do match ex-isting best results, they come with opportunitycosts in computational resources. We concludewith a review of promising areas for researchon student essays where the unique charac-teristics of Transformers may provide benefitsover classical methods to justify the costs.

1 Introduction

Automated essay scoring (AES) mimics the judg-ment of educators evaluating the quality of studentwriting. Originally used for summative purposesin standardized testing and the GRE (Chen et al.,2016), these systems are now frequently foundin classrooms (Wilson and Roscoe, 2019), typi-cally enabled by training data scored on reliablerubrics to give consistent and clear goals for writ-ers (Reddy and Andrade, 2010).

More broadly, the natural language process-ing (NLP) research community in recent yearshas been dominated by deep neural network re-search, in particular, the Transformer architec-ture popularized by BERT (Devlin et al., 2019).These models use large volumes of existing textdata to pre-train multilayer neural networks withcontext-sensitive meaning of, and relations be-tween, words. The models, which often consist ofover 100 million parameters, are then fine-tuned toa specific new labeled dataset and used for classi-fication, generation, or structured prediction.

Research in AES, though, has tended to pri-oritize simpler models, usually multivariate re-gression using a small set of justifiable variableschosen by psychometricians (Attali and Burstein,2004). This produces models that retain di-rect mappings between variables and recognizablecharacteristics of writing, like coherence or lex-ical sophistication (Yannakoudakis and Briscoe,2012; Vajjala, 2018). In psychometrics more gen-erally, this focus on features as valid “constructs”leans on a rigorous and well-defined set of princi-ples (Attali, 2013). This approach is at odds withTransformer-based research, and so our core ques-tion for this work is: for AES specifically, is amove to deep neural models worth the cost?

The chief technical contribution of this work isto measure results for BERT when fine-tuned forAES. In section 3 we describe an experimentalsetup with multiple levels of technical difficultyfrom bag-of-words models to fine-tuned Trans-formers, and in section 5 we show that the ap-proaches perform similarly. In AES, human inter-rater reliability creates a ceiling for scoring modelaccuracy. While Transformers match state-of-the-art accuracy, they do so with significant tradeoffs;we show that this includes a slowdown in train-ing time of up to 100x. Our data shows that theseTransformer models improve on N -gram base-lines by no more than 5%. Given this result, insection 6 we describe areas of contemporary re-search on Transformers that show both promisingearly results and a potential alignment to educa-tional pedagogy beyond reliable scoring.

2 Background

In AES, student essays are scored either on a sin-gle holistic scale, or analytically following a rubricthat breaks out subscores based on “traits.” Thesescores are almost always integer-valued, and al-most universally have fewer than 10 possible scorepoints, though some research has used scales with

152

as many as 60 points (Shermis, 2014). In mostcontexts, students respond to “prompts,” a specificwriting activity with predefined content. Work innatural language processing and speech evaluationhas used advanced features like discourse coher-ence (Wang et al., 2013) and argument extraction(Nguyen and Litman, 2018); for proficient writersin professional settings, automated scaffolds likegrammatical error detection and correction alsoexist (Ng et al., 2014).

Natural language processing has historicallyused n-gram bag-of-words features to predict la-bels for documents. These were the standard rep-resentation of text data for decades and are still inwidespread use (Jurafsky and Martin, 2014). Inthe last decade, the field moved to word embed-dings, where words are represented not as a sin-gle feature but as dense vectors learned from largeunsupervised corpora. While early approaches todense representations using latent semantic anal-ysis have been a major part of the literature onAES (Foltz et al., 2000; Miller, 2003), these werecorpus-specific representations. In contrast, re-cent work is general-purpose, resulting in off-the-shelf representations like GloVe (Penningtonet al., 2014). This allows similar words to haveapproximately similar representations, effectivelymanaging lexical sparsity.

But the greatest recent innovation has been con-textual word embeddings, based on deep neuralnetworks and in particular, Transformers. Ratherthan encoding a word’s semantics as a staticvector, these models adjust the representation ofwords based on their context in new documents.With multiple layers and sophisticated attentionmechanisms (Bahdanau et al., 2015), these newermodels have outperformed the state-of-the-art onnumerous tasks, and are currently the most ac-curate models on a very wide range of tasks(Vaswani et al., 2017; Dai et al., 2019). Themost popular architecture, BERT, produces a 768-dimensional final embedding based on a networkwith over 100 million total parameters in 12 lay-ers; pre-trained models are available for opensource use (Devlin et al., 2019).

For document classification, BERT is “fine-tuned” by adding a final layer at the end of theTransformer architecture, with one output neuronper class label. When learning from a new set oflabeled training data, BERT evaluates the trainingset multiple times (each pass is called an epoch).

A loss function, propagating backward to the net-work, allows the model to learn relationships be-tween the class labels in the new data and thecontextual meaning of the words in the text. Alearning rate determines the amount of change to amodel’s parameters. Extensive results have shownthat careful control of the learning rate in a cur-riculum can produce an effective fine-tuning pro-cess (Smith, 2018). While remarkably effective,our community is only just beginning to identifyexactly what is learned in this process; research in“BERT-ology” is ongoing (Kovaleva et al., 2019;Jawahar et al., 2019; Tenney et al., 2019).

These neural models are just starting to beused in machine learning for AES, especially asan intermediate representation for automated es-say feedback (Fiacco et al., 2019; Nadeem et al.,2019). End-to-end neural AES models are in theirinfancy and have only seen exploratory studieslike Rodriguez et al. (2019); to our knowledge, nocommercial vendor yet uses Transformers as therepresentation for high-stakes automated scoring.

3 NLP for Automated Essay Scoring

To date, there are no best practices on fine-tuning Transformers for AES; in this section wepresent options. We begin with a classical baselineof traditional bag-of-words approaches and non-contextual word embeddings, used with NaıveBayes and logistic regression classifiers, respec-tively. We then describe three curriculum learn-ing options for fine-tuning BERT using AES databased on broader best practices. We end withtwo approaches based on BERT but without fine-tuning, with reduced hardware requirements.

3.1 Bag-of-Words Representations

The simplest features for document classificationtasks, “bag-of-words,” extracts surface N -gramsof length 1-2 with “one-hot” binary values indi-cating presence or absence in a document. Inprior AES results, this representation is surpris-ingly effective, and can be improved with simpleextensions: N -grams based on part-of-speech tags(of length 2-3) to capture syntax independent ofcontent, and character-level N -grams of length 3-4, to provide robustness to misspellings (Woodset al., 2017; Riordan et al., 2019). This high-dimensional representation typically has a cutoffthreshold where rare tokens are excluded: in ourimplementation, we exclude N -grams without at

153

least 5 occurrences in training data. Even afterthis reduction, this is a sparse feature space withthousands of dimensions. For learning with bag-of-words, we use a Naıve Bayes classifier withLaplace smoothing from Scikit-learn (Pedregosaet al., 2011), with part-of-speech tagging fromSpaCy (Honnibal and Montani, 2017).

3.2 Word Embeddings

A more modern representation of text uses word-level embeddings. This produces a vector, typi-cally of up to 300 dimensions, representing eachword in a document. In our implementation, werepresent each document as the term-frequency-weighted mean of word-level embedding vectorsfrom GloVe (Pennington et al., 2014). Unlike one-hot bag-of-words features, embeddings have densereal-valued features and Naıve Bayes models areinappropriate; we instead train a logistic regres-sion classifier with the LibLinear solver (Fan et al.,2008) and L2 regularization from Scikit-learn.

3.3 Fine-Tuning BERT

Moving to neural models, we fine-tune an uncasedBERT model using the Fast.ai library. This li-brary’s visibility to first-time users of deep learn-ing and accessible online learning materials1 meantheir default choices are the most accessible routefor practitioners.

Fast.ai recomends use of cyclical learning ratecurricula for fine-tuning. In this policy, an upperand lower bound on learning rates are established.lrmax is a hyperparameter defining the maximumlearning rate in one epoch of learning. In cyclicallearning, the learning rate for fine-tuning begins atthe lower bound, rises to the upper bound, then de-scends back to the lower bound. A high learningrate midway through training acts as regulariza-tion, allowing the model to avoid overfitting andavoiding local optima. Lower learning rates at thebeginning and end of cycles allow for optimizationwithin a local optimum, giving the model an op-portunity to discover fine-grained new informationagain. In our work, we set lrmax = 0.00001. Alower bound is then derived from the upper bound,lrmin = 0.04 ∗ lrmax; this again is default behav-ior in the Fast.ai library.

We assess three different curricula for cyclicallearning rates, visualized in Figure 1. In the de-fault approach, a maximum learning rate is set and

1https://course.fast.ai/

Figure 1: Illustration of cyclical (top), two-periodcyclical (middle, log y-scale), and 1-cycle (bottom)learning rate curricula over N epochs.

cycles are repeated until reaching a threshold; fora halting criterion, we measure validation set ac-curacy. Because of noise in deep learning train-ing, halting at any decrease can lead to prematurestops; it is preferable to allow some occasional,small drop in performance. In our implementa-tion we halt when accuracy on a validation set,measured in quadratic weighted kappa, decreasesby over 0.01. In the second, “two-rate” approach(Smith, 2018), we follow this algorithm, but whenwe would halt, we instead backtrack by one epochto a saved version of the network and restart train-ing with a learning rate of 1 × 10−6 (one order ofmagnitude smaller). Finally, in the “1-cycle” pol-icy, training is condensed into a single rise-and-fallpattern, spread over N epochs. Defining the exacttraining time N is a hyperparameter tuned on val-idation data. Finally, while BERT is optimized forsentence encoding, it is able to process documentsup to 512 words long. In our data, we truncate asmall number of essays longer than this maximum,mostly in ASAP dataset #2.

3.4 Feature Extraction from BERT

Fine-tuning is computationally expensive and canonly run on GPU-enabled devices. Many prac-

154

titioners in low-resource settings may not haveaccess to appropriate cloud computing environ-ments for these techniques. Previous work has de-scribed a compromise approach for using Trans-former models without fine-tuning. In Peters et al.(2019), the authors describe a new pipeline. Doc-ument texts are processed with an untuned BERTmodel; the final activations from network on the[CLS] token are then used directly as contex-tual word embeddings. This 768-dimensional fea-ture vector represents the full document, and isused as inputs for a linear classifier. In the edu-cation context, a similar approach was describedin Nadeem et al. (2019) as a baseline for evalua-tion of language-learner essays. This process al-lows us to use the world knowledge embedded inBERT without requiring fine-tuning of the modelitself, and without need for GPUs at training orprediction time. For our work, we train a logisticregression classifier as described in Section 3.2.

3.5 DistilBERT

Recent work has highlighted the extreme car-bon costs of full Transformer fine-tuning (Strubellet al., 2019) and the desire for Transformer-basedprediction on-device without access to cloud com-pute. In response to these concerns, Sanh et al.(2019) introduce DistilBERT, which they argue isequivalent to BERT in most practical aspects whilereducing parameter size by 40% to 66 million, anddecreasing model inference time by 60%. Thisis accomplished using a distillation method (Hin-ton et al., 2015) in which a new, smaller “stu-dent” network is trained to reproduce the behav-ior of a pretrained “teacher” network. Once thesmaller model is pretrained, interacting with it forthe purposes of fine-tuning is identical to interact-ing with BERT directly. DistilBERT is intendedfor use cases where compute resources are a con-straint, sacrificing a small amount of accuracy fora drastic shrinking of network size. Because ofthis intended use case, we only present results forDistilBERT with the “1-cycle” learning rate pol-icy, which is drastically faster to fine-tune.

4 Experiments

To test the overall impact of fine-tuning in theAES domain, we use five datasets from the ASAPcompetition, jointly hosted by the Hewlett Foun-dation and Kaggle.com (Shermis, 2014). This setof essay prompts was the subject of intense pub-

lic attention and scrutiny in 2012 and its pub-lic release has shaped the discourse on AES eversince. For our experiments, we use the origi-nal, deanonymized data from Shermis and Hamner(2012); an anonymized version of these datasets isavailable online2. In all cases, human inter-raterreliability (IRR) is an approximate upper boundon performance, but reliability above human IRRis possible, as all models are trained on resolvedscores that represent two scores plus a resolutionprocess for disagreements between annotators.

We focus our analysis on the five datasets thatmost closely resemble standard AES rubrics, dis-carding three datasets - prompts #1, 7, and 8 -with a scale of 10 or more possible points. Re-sults on these datasets are not representative ofoverall performance and can skew reported re-sults due to rubric idiosyncracies, making compar-ison to other published work impossible (see forexample (Alikaniotis et al., 2016), which groups60-point and 4-point rubrics into a single datasetand therefore produced correlations that cannot bealigned to results from any other published work).Prompts 2-6 are scored on smaller rubric scaleswith 4-6 points, and are thus generalizable to moreAES contexts. Note that nevertheless, each datasethas its own idiosyncracies; for instance, essaysin dataset #5 were written by younger students in7th and 8th grade, while dataset #4 contains writ-ing from high school seniors; datasets #3 and 4were responses to specific texts while others wereopen-ended; and scores in dataset #2 was actuallyscored on two separate traits, the second of whichis often discarded in followup work (as it is here).Our work here does not specifically isolate effectsof these differences that would lead to discrepan-cies in performance or in modeling behavior.

4.1 Metrics and Baselines

For measuring reliability of automated assess-ments, we use a variant of Cohen’s κ, withquadratic weights for “near-miss” predictions onan ordinal scale (QWK). This metric is standardin the AES community (Williamson et al., 2012).High-stakes testing organizations differ on exactcutoffs for acceptable performance, but thresholdvalues between 0.6 and 0.8 QWK are typicallyused as a floor for testing purposes; human reli-ability below this threshold is generally not fit forsummative student assessment.

2https://www.kaggle.com/c/asap-aes

155

In addition to measuring reliability, we alsomeasure training and prediction time, in seconds.As this work seeks to evaluate the practical trade-offs of the move to deep neural methods, this isan important secondary metric. For all experi-ments, training was performed on Google ColabPro cloud servers with 32 GB of RAM and anNVideo Tesla P100 GPGPU.

We compare the results of BERT against severalpreviously published benchmarks and results.

• Human IRR as initially reported in theHewlett Foundation study (Shermis, 2014).

• Industry best performance, as reported byeight commercial vendors and one open-source research team in the initial release ofthe ASAP study. (Shermis, 2014).

• An early deep learning approach using acombination CNN+LSTM architecture thatoutperformed most reported results at thattime (Taghipour and Ng, 2016).

• Two recent results using traditional non-neural models: Woods et al. (2017), whichuses n-gram features in an ordinal logistic re-gression, and Cozma et al. (2018), which usesa mix of string kernels and word2vec embed-dings in a support vector regression.

• Rodriguez et al. (2019), the one previously-published work that attempts AES with a va-riety of pretrained neural models, includingBERT and the similar XLNet (Yang et al.,2019), with numerous alternate configura-tions and training methods. We report theirresult with a baseline BERT fine-tuning pro-cess, as well as their best-tuned model afterextensive optimization.

4.2 Experimental Setup

Following past publications, we train a separatemodel on each dataset, and evaluate all dataset-specific models using 5-fold cross-validation.Each of the five datasets contains approximately1,800 essays, resulting in folds of 360 essayseach. Additionally, for measuring loss when fine-tuning BERT, we hold out an additional 20% ofeach training fold as a validation set, meaning thateach fold has approximately 1,150 essays used fortraining and 300 essays used for validation. We

report mean QWK across the five folds. For mea-surement of training and prediction time, we re-port the sum of training time across all five foldsand all datasets. For slow-running feature ex-traction, like N -gram part-of-speech features andword embedding-based features, we tag each sen-tence in the dataset only once and cache the re-sults, rather than re-tagging each sentence on eachfold. Finally, for models where distinguishingextraction from training time is meaningful, wepresent those times separately.

5 Results

5.1 Accuracy Evaluation

Our primary results are presented in Table 1. Wefind, broadly, that all approaches to machine learn-ing replicate human-level IRR as measured byQWK. Nearly eight years after the publication ofthe original study, no published results have ex-ceeded vendor performance on three of the fiveprompt datasets; in all cases, a naive N -gram ap-proach underperforms the state-of-the-art in indus-try and academia by 0.03-0.06 QWK.

Of particular note is the low performance ofGloVe embeddings relative to either neural or N -gram representations. This is surprising: whileword embeddings are less popular now than deepneural methods, they still perform well on a widerange of tasks (Baroni et al., 2014). Few publi-cations have noted this negative result for GloVein the AES domain; only Dong et al. (2017) usesGloVe as the primary representation of ASAPtexts in an LSTM model, reporting lower QWKresults than any baseline we presented here. Onesimple explanation for this may be that individ-ual keywords matter a great deal for model per-formance. It is well established that vocabulary-based approaches are effective in AES tasks (Hig-gins et al., 2014) and the lack of access to specificword-based features may hinder semantic vectorrepresentation. Indeed, only one competitive re-cent paper on AES uses non-contextual word vec-tors: Cozma et al. (2018). In this implementation,they do use word2vec, but rather than use wordembeddings directly they first cluster words intoa set of 500 “embedding clusters.” Words that ap-pear in texts are then counted in the feature vectoras the centroid of that cluster - in effect, creating a500-dimensional bag-of-words model.

Our results would suggest that fine-tuning withBERT also reaches approximately the same level

156

Table 1: Performance on each of ASAP datasets 2-6, inQWK. The final row shows the gap in QWK betweenthe best neural model and the N-gram baseline.

Model 2 3 4 5 6Human IRR .80 .77 .85 .74 .74Hewlett .74 .75 .82 .83 .78Taghipour .69 .69 .81 .81 .82Woods .71 .71 .81 .82 .83Cozma .73 .68 .83 .83 .83Rodriguez (BERT) .68 .72 .80 .81 .81Rodriguez (best) .70 .72 .82 .82 .82N-Grams .71 .71 .78 .80 .79Embeddings .42 .41 .60 .49 .36BERT-CLR .66 .70 .80 .80 .79BERT-1CYC .64 .71 .82 .81 .79BERT Features .61 .59 .75 .75 .74DistilBERT .65 .70 .82 .81 .79N-Gram Gap -.05 .00 .04 .01 .00

of performance as other methods, slightly un-derperforming previous published results. Thisis likely an underestimate, due to the complex-ity of hyperparameter optimization and curricu-lum learning for Transformers. Rodriguez et al.(2019) demonstrate that it is, in fact, possibleto improve the performance of neural models tomore closely approach (but not exceed) the state-of-the-art using neural models. Sophisticated ap-proaches like gradual unfreezing, discriminativefine-tuning, or greater parameterization throughnewer deep learning models in their work consis-tently produces improvements of 0.01-0.02 QWKcompared to the default BERT implementation.But this result emphasizes our concern: we donot claim our results are the best that could beachieved with BERT fine-tuning. We are, in fact,confident that they can be improved through opti-mization. What the results demonstrate instead isthat the ceiling of results for AES tasks lessens thevalue of that intensive optimization effort.

5.2 Runtime EvaluationOur secondary evaluation of models is based ontraining time and resource usage; those results arereported in Table 2. Here, we see that deep learn-ing approaches on GPU-enabled cloud computeproduce an approximately 30-100 fold increase inend-to-end training time compared to a naive ap-proach. In fact, this understates the gap, as approx-imately 75% of feature extraction and model train-ing time in the naive approach is due to part-of-

Table 2: Cumulative experiment runtime, in seconds,of feature extraction (F), model training (T), and pre-dicting on test sets (P), for ASAP datasets 2-6 with 5-fold cross-validation. Models with 1-cycle fine-tuningare measured at 5 epochs.

Model F T P TotalEmbeddings 93 6 1 100N-Grams 82 27 2 111BERT Features 213 10 1 224DistilBERT 1,972 108 2,080BERT-1CYC 2,956 192 3,148BERT-CLR 11,309 210 11,519

speech tagging rather than learning. Using BERTfeatures as inputs to a linear classifier is an in-teresting compromise option, producing slightlylower performance on these datasets but with onlya 2x slowdown at training time, all in feature ex-traction, and potentially retaining some of the se-mantic knowledge of the full BERT model. Fur-ther investigation should test whether additionalfeatures for intermediate layers, as explored in Pe-ters et al. (2019), is merited for AES.

We can look at this gap in training runtime moreclosely in Figure 2. Essays in the prompt 2 datasetare longer persuasive essays and are on average378 words long, while datasets 3-6 correspond toshorter, source-based content knowledge promptsand are on average 98-152 words long. The needfor truncation in dataset #2 for BERT, but not forother approaches, may explain the underperfor-mance of the model in that dataset. Additionally,differences across datasets highlight two key dif-ferences for fine-tuning a BERT model:

• Training time increases linearly with numberof epochs and with average document length.As seen in Figure 2, this leads to a longertraining for the longer essays of dataset 2,nearly as long as the other datasets combined.

• Performance converges on human inter-raterreliability more quickly for short content-based prompts, and performance begins todecrease due to overfitting in as few as 4epochs. By comparison, in the longer, per-suasive arguments of dataset 2, very smallperformance gains on held-out data contin-ued even at the end of our experiments.

Figure 2 also presents results for DistilBERT.Our work verifies prior published claims of speed

157

Figure 2: QWK (top) and training time (bottom, in sec-onds) and for 5-fold cross-validation of 1-cycle neuralfine-tuning on ASAP datasets 2-6, for BERT (left) andDistilBERT (right).

improvements both in fine-tuning and at predictiontime, relative to the baseline BERT model: train-ing time was reduced by 33% and prediction timewas reduced by 44%. This still represents at leasta 20x increase in runtime relative toN -gram base-lines both for training and prediction.

6 Discussion

For scoring essays with reliably scored, prompt-specific training sets, both classical and neural ap-proaches produce similar reliability, at approxi-mately identical levels to human inter-rater relia-bility. There is a substantial increase in techni-cal overhead required to implement Transformersand fine-tune them to reach this performance, withminimal gain compared to baselines. The pol-icy lesson for NLP researchers is that using deeplearning for scoring alone is unlikely to be justifi-able, given the slowdowns at both training and in-ference time, and the additional hardware require-ments. For scoring, at least, Transformer architec-tures are a hammer in search of a nail.

But it’s hardly the case that automated writingevaluation is limited to scoring. In this section wecover major open topics for technical researchersin AES to explore, focusing on areas where neuralmodels have proven strengths above baselines in

other domains. We prioritize three major areas:domain transfer, style, and fairness. In each wecite specific past work that indicates a plausiblepath forward for research.

6.1 Domain TransferA major challenge in AES is the inability ofprompt-specific models to generalize to new es-say topics (Attali et al., 2010; Lee, 2016). Col-lection of new prompt-specific training sets, withreliable scores, continues to be one of the ma-jor stumbling blocks to expansion of AES sys-tems in curricula (Woods et al., 2017). Rel-atively few researchers have made progress ongeneric essay scoring: Phandi et al. (2015) in-troduces a Bayesian regression approach that ex-tracts N -gram features then capitalizes on cor-related features across prompts. Jin et al.(2018) shows promising prompt-independent re-sults using an LSTM architecture with surface andpart-of-speech N -gram inputs, underperformingprompt-specific models by relatively small mar-gins across all ASAP datasets. But in implemen-tations, much of the work of practitioners is basedon workarounds for prompt-specific models; Wil-son et al. (2019), for instance, describes psycho-metric techniques for measuring generic writingability across a small sample of known prompts.

While Transformers are sensitive to the datathey were pretrained on, they are well-suited totransfer tasks in mostly unseen domains, as ev-idenced by part-of-speech tagging for historicaltexts (Han and Eisenstein, 2019), sentiment clas-sification on out-of-domain reviews (Myagmaret al., 2019), and question answering in new con-texts (Houlsby et al., 2019). This last result ispromising for content-based short essay prompts,in particular. Our field’s open challenge in scor-ing is to train AES models that can meaningfullyevaluate short response texts for correctness basedon world knowledge and domain transfer, ratherthan memorizing the vocabulary of correct, in-domain answers. Promising early results showthat relevant world knowledge is already embed-ded in BERT’s pretrained model (Petroni et al.,2019). This means that BERT opens up a poten-tially tractable path to success that was simply notpossible with N -gram models.

6.2 Style and VoiceSkepticism toward AES in the classroom comesfrom rhetoric and composition scholars, who ex-

158

press concerns about its role in writing pedagogy(NCTE, 2013; Warner, 2018). Indeed, the rela-tively “solved” nature of summative scoring thatwe highlight here is of particular concern to theseexperts, noting the high correlation between scoresand features like word count (Perelman, 2014).

Modern classroom use of AES beyond high-stakes scoring, like Project Essay Grade (Wilsonand Roscoe, 2019) or Turnitin Revision Assistant(Mayfield and Butler, 2018), makes claims of sup-porting student agency and growth; here, adapt-ing to writer individuality is a major current gap.Dixon-Roman et al. (2019) raises a host of ques-tions about these topics specifically in the contextof AES, asking how algorithmic intervention canproduce strong writers rather than merely good es-says: “revision, as adjudicated by the platform,is [...] a re-direction toward the predeterminedshape of the ideal written form [...] a puzzle-doerrecursively consulting the image on the puzzle-box, not that of author returning to their words tomake them more lucid, descriptive, or forceful.”

This critique is valid: research on machinetranslation, for instance, has shown that writerstyle is not preserved across languages (Rabi-novich et al., 2017). There is uncharted territoryfor AES to adapt to individual writer styles andgive feedback based on individual writing ratherthan prompt-specific exemplars. Natural languageunderstanding researchers now argue that “...styleis formed by a complex combination of differentstylistic factors” (Kang and Hovy, 2019); Style-specific natural language generation has shownpromise in other domains (Hu et al., 2017; Prab-humoye et al., 2018) and has been extended notjust to individual preferences but also to overlap-ping identities based on attitudes like sentimentand personal attributes like gender (Subramanianet al.). Early work suggests that style-specificmodels do see major improvements when shiftingto high-dimensionality Transformer architectures(Keskar et al., 2019). This topic bridges an im-portant gap: for assessment, research has shownthat “authorial voice” has measurable outcomes onwriting impact (Matsuda and Tardy, 2007), whileindividual expression is central to decades of peda-gogy (Elbow, 1987). Moving the field toward indi-vidual expression and away from prompt-specificdatasets may be a path to lending legitimacy toAES, and Transformers may be the technical leapnecessary to make these tasks work.

6.3 Fairness

Years ago, researchers suggested that demo-graphic bias is worth checking in AES systems(Williamson et al., 2012). But years later, thefield has primarily reported fairness experimentson simulated data, and shared toolkits for measur-ing bias, rather than results on real-world AES im-plementations or high-stakes data (Madnani et al.,2017; Loukina et al., 2019).

Prompted by social scientists (Noble, 2018),NLP researchers have seen a renaissance of fair-ness research based on the flaws in default im-plementations of Transformers (Bolukbasi et al.,2016; Zhao et al., 2017, 2018). These works typ-icallly seek to reduce the amplification of bias inpretrained models, starting with easy-to-measureproof that demographic bias can be “removed”from word embedding spaces. But iterating on in-puts to algorithmic classifiers – precisely the in-tended use case of formative eeedback for writ-ers! – can reduce the efficacy of “de-biasing” (Liuet al., 2018; Dwork and Ilvento, 2019). More re-cent research has shown that bias may simply bemasked by these approaches, rather than resolved(Gonen and Goldberg, 2019).

What these questions offer, though, is a well-spring of new and innovative technical research.Developers of learning analytics software, includ-ing AES, are currently encouraged to focus onscalable experimental evidence of efficacy forlearning outcomes (Saxberg, 2017), rather than fo-cus on specific racial or gender bias, or other eq-uity outcomes that are more difficult to achievethrough engineering. But Transformer architec-tures are nuanced enough to capture immenseworld knowledge, producing a rapid increase inexplainability in NLP (Rogers et al., 2020).

Meanwhile, in the field of learning analytics, aburgeoning new field of fairness studies are learn-ing how to investigate these issues in algorithmiceducational systems (Mayfield et al., 2019; Hol-stein and Doroudi, 2019). Outside of technol-ogy applications but in writing assessment morebroadly, fairness is also a rich topic with a historyof literature to learn from (Poe and Elliot, 2019).Researchers at the intersection of both these fieldshave an enormous open opportunity to better un-derstand AES in the context fairness, using the lat-est tools not just to build reliable scoring but toadvance social change.

159

ReferencesDimitrios Alikaniotis, Helen Yannakoudakis, and

Marek Rei. 2016. Automatic text scoring using neu-ral networks. In Proceedings of the Association forComputational Linguistics, pages 715–725.

Yigal Attali. 2013. Validity and reliability of auto-mated essay scoring. Handbook of automated es-say evaluation: Current applications and new direc-tions, page 181.

Yigal Attali, Brent Bridgeman, and Catherine Trapani.2010. Performance of a generic approach in auto-mated essay scoring. The Journal of Technology,Learning and Assessment, 10(3).

Yigal Attali and Jill Burstein. 2004. Automated essayscoring with e-rater R© v. 2.0. ETS Research ReportSeries, (2).

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedingsof the Association for Computational Linguistics,pages 238–247.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. In D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors, Advances in Neural InformationProcessing Systems 29, pages 4349–4357. CurranAssociates, Inc.

Jing Chen, James H Fife, Isaac I Bejar, and Andre ARupp. 2016. Building e-rater R© scoring models us-ing machine learning methods. ETS Research Re-port Series, 2016(1):1–12.

Madalina Cozma, Andrei Butnaru, and Radu TudorIonescu. 2018. Automated essay scoring with stringkernels and word embeddings. In Proceedings of theAssociation for Computational Linguistics.

Zihang Dai, Zhilin Yang, Yiming Yang, William WCohen, Jaime Carbonell, Quoc V Le, and RuslanSalakhutdinov. 2019. Transformer-xl: Attentive lan-guage models beyond a fixed-length context. In Pro-ceedings of the Association for Computational Lin-guistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of the NAACL HLT Conference.

Ezekiel Dixon-Roman, T. Philip Nichols, and AmaNyame-Mensah. 2019. The racializing forces of/inai educational technologies. Learning, Media &Technology Special Issue on AI and Education: Crit-ical Perspectives and Alternative Futures.

Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-based recurrent convolutional neural network for au-tomatic essay scoring. In Proceedings of the Confer-ence on Computational Natural Language Learning,pages 153–162.

Cynthia Dwork and Christina Ilvento. 2019. Fairnessunder composition. In Innovations in TheoreticalComputer Science Conference. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Peter Elbow. 1987. Closing my eyes as i speak: Anargument for ignoring audience. College English,49(1):50–69.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: Alibrary for large linear classification. Journal of Ma-chine Learning Research, 9(Aug):1871–1874.

James Fiacco, Elena Cotos, and Carolyn Rose. 2019.Towards enabling feedback on rhetorical structurewith neural sequence models. In Proceedings of theInternational Conference on Learning Analytics &Knowledge, pages 310–319. ACM.

Peter W Foltz, Sara Gilliam, and Scott Kendall. 2000.Supporting content-based feedback in on-line writ-ing evaluation with lsa. Interactive Learning Envi-ronments, 8(2):111–127.

Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic gen-der biases in word embeddings but do not removethem. Proceedings of the Association for Computa-tional Linguistics.

Xiaochuang Han and Jacob Eisenstein. 2019. Unsuper-vised domain adaptation of contextualized embed-dings: A case study in early modern english. arXivpreprint arXiv:1904.02817.

Derrick Higgins, Chris Brew, Michael Heilman, Ra-mon Ziai, Lei Chen, Aoife Cahill, Michael Flor,Nitin Madnani, Joel Tetreault, Daniel Blanchard,et al. 2014. Is getting the right answer just aboutchoosing the right words? the role of syntactically-informed features in short answer scoring. arXivpreprint arXiv:1403.0801.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. stat,1050:9.

Kenneth Holstein and Shayan Doroudi. 2019. Fairnessand equity in learning analytics systems (fairlak).in. In Companion Proceedings of the InternationalLearning Analytics & Knowledge Conference (LAK2019).

160

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,Bruna Morrone, Quentin De Laroussilhe, AndreaGesmundo, Mona Attariyan, and Sylvain Gelly.2019. Parameter-efficient transfer learning for nlp.In Proceedings of the International Conference onMachine Learning, pages 2790–2799.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In Proceedings of the In-ternational Conference on Machine Learning, pages1587–1596.

Ganesh Jawahar, Benoıt Sagot, and Djame Seddah.2019. What does bert learn about the structure oflanguage? In Proceedings of the Association forComputational Linguistics, pages 3651–3657.

Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018.Tdnn: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceed-ings of the Association for Computational Linguis-tics, pages 1088–1097.

Dan Jurafsky and James H Martin. 2014. Speech andlanguage processing, volume 3. Pearson London.

Dongyeop Kang and Eduard Hovy. 2019. xslue: Abenchmark and analysis platform for cross-style lan-guage understanding and evaluation. arXiv preprintarXiv:1911.03663.

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer language model for control-lable generation. arXiv preprint arXiv:1909.05858.

Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark se-crets of bert. In Proceedings of Empirical Methodsin Natural Language Processing, volume 1, pages2465–2475.

Yong-Won Lee. 2016. Investigating the feasibility ofgeneric scoring models of e-rater for toefl ibt inde-pendent writing tasks. English Language Teaching,71.

Lydia T Liu, Sarah Dean, Esther Rolf, Max Sim-chowitz, and Moritz Hardt. 2018. Delayed impactof fair machine learning. In Proceedings of the In-ternational Conference on Machine Learning.

Anastassia Loukina, Nitin Madnani, and Klaus Zech-ner. 2019. The many dimensions of algorithmic fair-ness in educational applications. In Proceedings ofthe ACL Workshop on Innovative Use of NLP forBuilding Educational Applications.

Nitin Madnani, Anastassia Loukina, Alina von Davier,Jill Burstein, and Aoife Cahill. 2017. Building bet-ter open-source tools to support fairness in auto-mated scoring. In Proceedings of the ACL Workshopon Ethics in Natural Language Processing, pages41–52.

Paul Kei Matsuda and Christine M Tardy. 2007. Voicein academic writing: The rhetorical construction ofauthor identity in blind manuscript review. Englishfor Specific Purposes, 26(2):235–249.

Elijah Mayfield and Stephanie Butler. 2018. Dis-trictwide implementations outperform isolated useof automated feedback in high school writing. InProceedings of the International Conference ofthe Learning Sciences (Industry and CommercialTrack).

Elijah Mayfield, Michael Madaio, Shrimai Prab-humoye, David Gerritsen, Brittany McLaughlin,Ezekiel Dixon-Roman, and Alan W Black. 2019.Equity beyond bias in language technologies for ed-ucation. In Proceedings of ACL Workshop on Inno-vative Use of NLP for Building Educational Appli-cations.

Tristan Miller. 2003. Essay assessment with latent se-mantic analysis. Journal of Educational ComputingResearch, 29(4):495–512.

Batsergelen Myagmar, Jie Li, and Shigetomo Kimura.2019. Transferable high-level representations ofbert for cross-domain sentiment classification. InProceedings on the International Conference on Ar-tificial Intelligence (ICAI), pages 135–141.

Farah Nadeem, Huy Nguyen, Yang Liu, and MariOstendorf. 2019. Automated essay scoring withdiscourse-aware neural models. In Proceedings ofthe ACL Workshop on Innovative Use of NLP forBuilding Educational Applications, pages 484–493.

NCTE. 2013. Position statement on machine scoring.National Council of Teachers of English. Accessed2019-09-24.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The conll-2014 shared taskon grammatical error correction. In Proceedingsof the Conference on Computational Natural Lan-guage Learning: Shared Task, pages 1–14.

Huy V Nguyen and Diane J Litman. 2018. Argumentmining for improving the automated scoring of per-suasive essays. In Proceedings of the AAAI Confer-ence on Artificial Intelligence.

Safiya Umoja Noble. 2018. Algorithms of oppression:How search engines reinforce racism. NYU Press.

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:

https://bit.ly/2lay7do

161

Machine learning in python. Journal of MachineLearning Research, 12(Oct):2825–2830.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of Empirical Meth-ods in Natural Language Processing, pages 1532–1543.

Les Perelman. 2014. When “the state of the art” iscounting words. Assessing Writing, 21:104–111.

Matthew E Peters, Sebastian Ruder, and Noah A Smith.2019. To tune or not to tune? adapting pretrainedrepresentations to diverse tasks. In Proceedings ofthe Workshop on Representation Learning for NLP(RepL4NLP-2019), pages 7–14.

Fabio Petroni, Tim Rocktaschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models asknowledge bases? In Proceedings of EmpiricalMethods in Natural Language Processing, pages2463–2473.

Peter Phandi, Kian Ming A Chai, and Hwee Tou Ng.2015. Flexible domain adaptation for automated es-say scoring using correlated linear regression. InProceedings of Empirical Methods in Natural Lan-guage Processing, pages 431–439.

Mya Poe and Norbert Elliot. 2019. Evidence of fair-ness: Twenty-five years of research in assessingwriting. Assessing Writing, 42:100418.

Shrimai Prabhumoye, Yulia Tsvetkov, RuslanSalakhutdinov, and Alan W Black. 2018. Styletransfer through back-translation. In Proceedingsof the Association for Computational Linguistics,pages 866–876.

Ella Rabinovich, Raj Nath Patel, Shachar Mirkin, Lu-cia Specia, and Shuly Wintner. 2017. Personal-ized machine translation: Preserving original au-thor traits. In Proceedings of the European Chap-ter of the Association for Computational Linguistics,pages 1074–1084.

Y Malini Reddy and Heidi Andrade. 2010. A review ofrubric use in higher education. Assessment & evalu-ation in higher education, 35(4):435–448.

Brian Riordan, Michael Flor, and Robert Pugh. 2019.How to account for mispellings: Quantifying thebenefit of character representations in neural contentscoring models. In Proceedings of the ACL Work-shop on Innovative Use of NLP for Building Educa-tional Applications, pages 116–126.

Pedro Uria Rodriguez, Amir Jafari, and Christopher MOrmerod. 2019. Language models and automatedessay scoring. arXiv preprint arXiv:1909.09482.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What we know abouthow bert works. arXiv preprint arXiv:2002.12327.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Bror Saxberg. 2017. Learning engineering: the art ofapplying learning science at scale. In Proceedingsof the ACM Conference on Learning@ Scale. ACM.

Mark D Shermis. 2014. State-of-the-art automated es-say scoring: Competition, results, and future direc-tions from a united states demonstration. AssessingWriting, 20:53–76.

Mark D Shermis and Ben Hamner. 2012. Contrastingstate-of-the-art automated scoring of essays: Analy-sis. In Annual national council on measurement ineducation meeting, pages 14–16.

Leslie N Smith. 2018. A disciplined approach to neu-ral network hyper-parameters: Part 1–learning rate,batch size, momentum, and weight decay. arXivpreprint arXiv:1803.09820.

Emma Strubell, Ananya Ganesh, and Andrew McCal-lum. 2019. Energy and policy considerations fordeep learning in nlp. Proceedings of the Associationfor Computational Linguistics.

Sandeep Subramanian, Guillaume Lample,Eric Michael Smith, Ludovic Denoyer,Marc’Aurelio Ranzato, and Y-Lan Boureau.Multiple-attribute text style transfer. Age,18(24):65.

Kaveh Taghipour and Hwee Tou Ng. 2016. A neu-ral approach to automated essay scoring. In Pro-ceedings of Empirical Methods in Natural LanguageProcessing, pages 1882–1891.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.Bert rediscovers the classical nlp pipeline. In Pro-ceedings of the Association for Computational Lin-guistics.

Sowmya Vajjala. 2018. Automated assessment of non-native learner essays: Investigating the role of lin-guistic features. International Journal of ArtificialIntelligence in Education, 28(1):79–105.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Xinhao Wang, Keelan Evanini, and Klaus Zechner.2013. Coherence modeling for the automated as-sessment of spontaneous spoken responses. In Pro-ceedings of the North American chapter of the Asso-ciation for Computational Linguistics: Human lan-guage technologies, pages 814–819.

John Warner. 2018. Why They Can’t Write: Killing theFive-Paragraph Essay and Other Necessities. JHUPress.

162

David M Williamson, Xiaoming Xi, and F Jay Breyer.2012. A framework for evaluation and use of au-tomated scoring. Educational measurement: issuesand practice, 31(1):2–13.

Joshua Wilson, Dandan Chen, Micheal P Sandbank,and Michael Hebert. 2019. Generalizability of auto-mated scores of writing quality in grades 3–5. Jour-nal of Educational Psychology, 111(4):619.

Joshua Wilson and Rod D Roscoe. 2019. Automatedwriting evaluation and feedback: Multiple metricsof efficacy. Journal of Educational Computing Re-search.

Bronwyn Woods, David Adamson, Shayne Miel, andElijah Mayfield. 2017. Formative essay feedbackusing predictive scoring models. In Proceedingsof ACM Conference on Knowledge Discovery andData Mining, pages 2071–2080. ACM.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In Advances in neural in-formation processing systems, pages 5754–5764.

Helen Yannakoudakis and Ted Briscoe. 2012. Model-ing coherence in esol learner texts. In Proceedingsof the ACL Workshop on Building Educational Ap-plications Using NLP, pages 33–43. Association forComputational Linguistics.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias amplification usingcorpus-level constraints. In Proceedings of Empiri-cal Methods in Natural Language Processing, pages2979–2989.

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral wordembeddings. In Proceedings of Empirical Methodsin Natural Language Processing, pages 4847–4853.

Documents

Should You Fine-Tune BERT for Automated Essay Scoring? · 2020-06-16 · 1 Introduction Automated essay scoring (AES) mimics the judg-ment of educators evaluating the quality of student