Tuning a Grammar Correction System for Increased Precisionpb/papers/conllst14-grammar-correction.pdf · Tuning a Grammar Correction System for Increased Precision Anoop Kunchukuttan

Tuning a Grammar Correction System for Increased Precision

Anoop Kunchukuttan∗, Sriram Chaudhury†, Pushpak Bhattacharyya∗∗ Department of Computer Science and Engineering, IIT Bombay, India

{anoopk,pb}@cse.iitb.ac.in

† Crimson Interactive Pvt. Limited, Mumbai, [email protected]

Abstract

In this paper, we propose two enhance-ments to a statistical machine translationbased approach to grammar correction forcorrecting all error categories. First, wepropose tuning the SMT systems to op-timize a metric more suited to the gram-mar correction task (F-β score) rather thanthe traditional BLEU metric used for tun-ing language translation tasks. Since theF-β score favours higher precision, tun-ing to this score can potentially improveprecision. While the results do not indi-cate improvement due to tuning with thenew metric, we believe this could be dueto the small number of grammatical er-rors in the tuning corpus and further in-vestigation is required to answer the ques-tion conclusively. We also explore thecombination of custom-engineered gram-mar correction techniques, which are tar-geted to specific error categories, with theSMT based method. Our simple ensem-ble methods yield improvements in recallbut decrease the precision. Tuning thecustom-built techniques can help in in-creasing the overall accuracy also.

1 Introduction

Grammatical Error Correction (GEC) is an inter-esting and challenging problem and the existingmethods that attempt to solve this problem takerecourse to deep linguistic and statistical analy-sis. In general, GEC may partly assist in solv-ing natural language processing (NLP) tasks likeMachine Translation, Natural Language Genera-tion etc. However, a more evident application ofGEC is in building automated grammar checkersthereby non-native speakers of a language. Thegoal is to have automated tools to help non-native

speakers to generate good content by correctinggrammatical errors made by them.

The CoNLL-2013 Shared Task (Ng et al., 2013)was focussed towards correcting some of the mostfrequent categories of grammatical errors. In con-trast, the CoNLL-2014 Shared Task (Ng et al.,2014) set the goal of correcting all grammaticalerrors in the text. For correcting specific errorcategories, custom methods are generally devel-oped, which exploit deep knowledge of the prob-lem to perform the correction (Han et al., 2006;Kunchukuttan et al., 2013; De Felice and Pulman,2008). These methods are generally the state-of-the-art for the concerned error categories, but a lotof engineering and research effort is required forcorrecting each error category. So, the custom de-velopment approach is infeasible for correcting alarge number of error categories.

Hence, for correction of all the error categories,generic methods have been investigated - gen-erally using language models or statistical ma-chine translation (SMT) systems. The languagemodel based method (Lee and Seneff, 2006; Kaoet al., 2013) scores sentences based on a lan-guage model or count ratios of n-grams obtainedfrom a large native text corpus. But this methodstill needs a candidate generation mechanism foreach error category. On the other hand, the SMTbased method (Brockett et al., 2006) formulatesthe grammar correction problem as a problem oftranslation of incorrect sentences to correct sen-tences. SMT provides a natural unsupervisedmethod for identifying candidate corrections inthe form of the translation model, and a methodfor scoring them with a variety of measures in-cluding the language model score. However, theSMT method requires a lot of parallel non-nativelearner corpora. In addition, the machinery inphrase based SMT is optimized towards solvingthe language translation problem. Therefore, thecommunity has explored approaches to adapt the

SMT method for grammar correction (Buys andvan der Merwe, 2013; Yuan and Felice, 2013).These include use of factored SMT, syntax basedSMT, pruning of the phrase table, disabling or re-ordering, etc. The generic SMT approach has per-formed badly as compared to the specific custommade approaches (Yuan and Felice, 2013).

Our system also builds upon the SMT methodsand tries to address the above mentioned lacunaein two ways:

• Tuning the SMT model to a metric suitablefor grammar correction (i.e.F-β metric), in-stead of the BLEU metric.

• Combination of custom-engineered methodsand SMT based methods, by using classifierbased for some error categories.

Section 2 describes our method for tuning theSMT system to optimize the F-β metric. Sec-tion 3 explains the combination of classifier basedmethod with the SMT method. Section 4 lists ourexperimental setup. Section 5 analyzes the resultsof our experiments.

2 Tuning SMT system for F-β score

We model our grammar correction system as aphrase based SMT system which translates gram-matically incorrect sentences to grammaticallycorrect sentences. The phrase based SMT systemselects the best translation for a source sentence bysearching for a candidate translation which maxi-mizes the score defined by the maximum entropymodel for phrase based SMT defined below:

P (e,a|f) = exp∑i

λihi(e,a, f)

where,hi: feature function for the ith feature. These aregenerally features like the phrase/lexical transla-tion probability, language model score, etc.λi: the weight parameter for the ith feature.

The weight parameters (λi) define the relativeweights given to each feature. These parame-ter weights are learnt during a process referred toas tuning. During tuning, a search over the pa-rameter space is done to identify the parametervalues which maximize a measure of translationquality over a held-out dataset (referred to as thetuning set). One of the most widely used met-rics for tuning is the BLEU score (Papineni et

al., 2002), tuned using the Minimum Error RateTraining (MERT) algorithm (Och, 2003). SinceBLEU is a form of weighted precision, along witha brevity penalty to factor in recall, it is suitablein the language translation scenario, where fidelityof the translation is an important in evaluation ofthe translation. Tuning to BLEU ensures that theparameter weights are set such that the fidelity oftranslations is high.

However, ensuring fidelity is not the major chal-lenge in grammar correction since the meaning ofmost input sentences is clear and most don’t haveany grammatical errors. The metric to be tunedmust ensure that weights are learnt such that thefeatures most relevant to correcting the grammarerrors are given due importance and that the tun-ing focuses on the grammatically incorrect partsof the sentences. The F-β score, as defined forthe CoNLL shared task, is the most obvious metricto measure the accuracy of grammar correction onthe tuning set. We choose the F-β metric as a scoreto be optimized using MERT for the SMT basedgrammar correction model. By choosing an appro-priate value of β, it is possible to tune the systemto favour increased recall/precision or a balance ofboth.

3 Integrating SMT based anderror-category specific systems

As discussed in Section 1, the generic SMT basedcorrection based systems are inferior in their cor-rection capabilities compared to the error-categoryspecific correction systems which have been cus-tom engineered for the task. A reasonable solutionto make optimum use of both the approaches is todevelop custom modules for correcting high im-pact and the most frequent error categories, whilerelying on the SMT method for correcting othererror categories. We experiment with two ap-proaches for integrating the SMT based and error-category specific systems, and compare both withthe baseline SMT approach:

• Correct all error categories using the SMTmethod, followed by correction using thecustom modules.

• Correct only the error categories not han-dled by the custom modules using the SMTmethod, followed by correction using thecustom modules.

The error categories for which we built cus-tom modules are noun number, determiner andsubject-verb agreement (SVA) errors. These er-rors are amongst the most common errors madeby non-native speakers. The noun number anddeterminer errors are corrected using the classifi-cation model proposed by Rozovskaya and Roth(2013), where the label space is a cross-productof the label spaces of the possible noun numberand determiners. We use the feature-set proposedby Kunchukuttan et al. (2013). SVA correctionis done using a prioritized, conditional rule basedsystem described by Kunchukuttan et al. (2013).

4 Experimental Setup

We used the NUCLE Corpus v3.1 to build aphrase based SMT system for grammar correction.The NUCLE Corpus contains 28 error categories,whose details are documented in Dahlmeier et al.(2013). We split the corpus into training, tuningand test sets are shown in Table 1.

Set Document Count Sentence Counttrain 1330 54284tune 20 854test 47 2013

Table 1: Details of data split for SMT training

The phrase based system was trained usingthe Moses1 system, with the grow-diag-final-and heuristic for extracting phrases and the msd-bidirectional-fe model for lexicalized reordering.We tuned the trained models using Minimum Er-ror Rate Training (MERT) with default parame-ters (100 best list, max 25 iterations). Instead ofBLEU, the tuning metric was the F-0.5 metric. Wetrained 5-gram language models on all the sen-tences from NUCLE corpus using the Kneser-Neysmoothing algorithm with SRILM 2.

The classifier for noun number and article cor-rection is a Maximum Entropy model trainedon the NUCLE v2.2 corpus using the MALLETtoolkit. Details about the resources and toolsused for feature extraction are documented inKunchukuttan et al. (2013).

1http://www.statmt.org/moses/2http://goo.gl/4wfLVw

5 Results and Analysis

Table 2 shows the results on the development setfor different experimental configurations gener-ated by varying the tuning metrics, and the methodof combining the SMT model and custom correc-tion modules. Table 3 shows the same results onthe official CoNLL 2014 dataset without alterna-tive answers.

5.1 Effect of tuning with F-0.5 scoreWe observe that both precision and recall dropsharply when the SMT model is tuned with theF-0.5 metric (system S2), as compared to tuningwith the traditional BLEU metric (system S1). Weobserve that system S2 proposes very few correc-tions (82) as compared to system S1 (188), whichcontributes to the low recall of system S2. Thereare very few errors in the tuning set (202) whichmay not be sufficient to reliably tune the systemto the F-0.5 score. It would be worth investigatingthe effect of number of errors in the tuning set onthe accuracy of the system.

5.2 Effect of integrating the SMT and custommodules

Comparing the results of systems S1, S3 and S5, itis clear that using the SMT method alone gives thehighest F-0.5 score. However, the recall is higherfor systems which use the custom modules forsome error categories. The recall is highest whencustom modules as well as SMT method are usedfor the high impact error categories. The aboveobservation is a consequence of the fact that thecustom modules have higher recall for certain er-ror categories compared to the SMT method. Thelower precision of custom modules is due to thelarge number of false positives. If the custommodules are optimized for higher precision, thenthe overall ensemble can also achieve higher pre-cision and consequently higher F-0.5 score. Thus,the integration of SMT method and custom mod-ules can be beneficial in improving the overall ac-curacy of the SMT system.

6 Conclusion

We explored two approaches to adapting the SMTmethod for the problem of grammatical correc-tion. Tuning the SMT system to the F-β metric didnot improve performance over the BLEU-basedtuning. However, we plan to further investigateto understand the reasons for this behaviour. We

Id SMT Data Custom Modules Tuning Metric %P %R %F-0.5S1

All errors

No BLEU 62.23 11.53 33.12S2 No F-0.5 55.32 5.13 18.71S3 Yes BLEU 10.99 26.33 12.44S4 Yes F-0.5 9.80 22.98 11.07S5 All errors, except Nn,

ArtOrDet, SVAYes BLEU 10.15 23.96 11.47

Table 2: Experimental Results for various configurations on the development set

Id SMT Data Custom Modules Tuning Metric %P %R %F-0.5S1

All errors

No BLEU 38.81 4.15 14.53S2 No F-0.5 30.77 1.39 5.90S3 Yes BLEU 29.02 17.98 25.85S4 Yes F-0.5 28.23 16.72 24.81S5 All errors, except Nn,

ArtOrDet, SVAYes BLEU 28.67 17.29 25.34

Table 3: Experimental Results for various configurations on the CoNLL-2014 test set without alternatives

also plan to explore tuning for recall and other al-ternative metrics which could be useful in somescenarios. An ensemble of the SMT method andcustom methods for some high impact error cate-gories was shown to increase the recall of the sys-tem, and with proper optimization of the systemcan also improve the overall accuracy of the cor-rection system.

References

Chris Brockett, William B Dolan, and Michael Ga-mon. 2006. Correcting ESL errors using phrasalSMT techniques. In Proceedings of the 21st Inter-national Conference on Computational Linguisticsand the 44th annual meeting of the Association forComputational Linguistics.

Jan Buys and Brink van der Merwe. 2013. A TreeTransducer Model for Grammatical Error Correc-tion. In Proceedings of the Seventeenth Confer-ence on Computational Natural Language Learn-ing: Shared Task.

Daniel Dahlmeier and Hwee Tou Ng. 2012. BetterEvaluation for Grammatical Error Correction. InProceedings of the 2012 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics.

Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu.2013. Building a large annotated corpus of learnerenglish: The NUS Corpus of Learner English. In Toappear in Proceedings of the 8th Workshop on Inno-vative Use of NLP for Building Educational Appli-cations.

Rachele De Felice and Stephen G Pulman. 2008. Aclassifier-based approach to preposition and deter-miner error correction in L2 English. In Proceed-ings of the 22nd International Conference on Com-putational Linguistics-Volume 1.

Na-Rae Han, Martin Chodorow, and Claudia Leacock.2006. Detecting errors in English article usage bynon-native speakers. Natural Language Engineer-ing.

Ting-hui Kao, Yu-wei Chang, Hsun-wen Chiu, Tzu-Hsi Yen, Joanne Boisson, Jian-cheng Wu, and Ja-son S. Chang. 2013. CoNLL-2013 Shared Task:Grammatical Error Correction NTHU System De-scription. In Proceedings of the Seventeenth Con-ference on Computational Natural Language Learn-ing: Shared Task.

Anoop Kunchukuttan, Ritesh Shah, and Pushpak Bhat-tacharyya. 2013. IITB System for CoNLL 2013Shared Task: A Hybrid Approach to Grammati-cal Error Correction. In Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning.

J. Lee and S. Seneff. 2006. Automatic grammar cor-rection for second-language learners. In Proceed-ings of Interspeech, pages 1978–1981.

Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, ChristianHadiwinoto, and Joel Tetreault. 2013. The CoNLL-2013 Shared Task on Grammatical Error Correction.In Proceedings of the Seventeenth Conference onComputational Natural Language Learning.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The CoNLL-2014 Shared Taskon Grammatical Error Correction. In Proceedings of

the Eighteenth Conference on Computational Natu-ral Language Learning.

Franz Josef Och. 2003. Minimum error rate trainingin statistical machine translation. In Proceedings ofthe 41st Annual Meeting on Association for Compu-tational Linguistics-Volume 1.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedingsof the 40th annual meeting on association for com-putational linguistics.

A. Rozovskaya and D. Roth. 2013. Joint Learningand Inference for Grammatical Error Correction. InEMNLP.

Zheng Yuan and Mariano Felice. 2013. ConstrainedGrammatical Error Correction using Statistical Ma-chine Translation. In Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning: Shared Task.

Documents

Tuning a Grammar Correction System for Increased Precisionpb/papers/conllst14-grammar-correction.pdf · Tuning a Grammar Correction System for Increased Precision Anoop Kunchukuttan