GRAMMATICAL ERROR CORRECTION · 2017-12-07 · B.Tech Project Stage 1 Report Submitted in partial fulﬁllment of the requirements for the degree of Bachelor of Technology (Honors)

GRAMMATICAL ERROR CORRECTION

B.Tech Project Stage 1 ReportSubmitted in partial fulfillment of

the requirements for the degree ofBachelor of Technology (Honors)

by

G. Krishna Chaitanya(140050038)

Supervisor:Prof. Pushpak Bhattacharyya

Department of Computer Science and Engineering

Indian Institute of Technology BombayMumbai 400076 (India)

14 November 2017

Abstract

Grammatical error correction (GEC) is the task of automatically correcting grammatical

errors in written text. Earlier attempts to grammatical error correction involve rule-based

and classifier approaches which are limited to correcting only some particular type of

errors in a sentence. As sentences may contain multiple errors of different types, a

practical error correction system should be able to detect and correct all errors.

In this report, we investigate GEC as a translation task from incorrect to correct En-

glish and explore some machine translation approaches for developing end-to-end GEC

systems for all error types. We apply Statistical Machine Translation (SMT) and Neural

Machine Translation (NMT) approaches to GEC and show that they can correct multiple

errors of different types in a sentence when compared to the earlier methods which focus

on individual errors. We also discuss some of the weakness of machine translation ap-

proaches. Finally, we also experiment on a candidate re-ranking technique to re-rank the

hypotheses generated by machine translation systems. With regression models, we try to

predict a grammaticality score for each candidate hypothesis and re-rank them according

to the score.

1

Acknowledgement

I would like to extend my heartfelt gratitude to my guide, Prof. Pushpak Bhattacharyya

and my co-guide, Abhijit Mishra for their constant guidance and support throughout the

project. I am extremely grateful to them for spending their valuable time with me for

clarifying my doubts whenever I approached them.

G. Krishna ChaitanyaIIT Bombay

14 November 2017

2

Contents

Abstract 1

Acknowledgement 2

1 Introduction 11.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Overview of Grammatical Error Correction 32.1 Types of grammatical errors . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Properties of an ideal GEC system . . . . . . . . . . . . . . . . . . . . . 5

2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.2 GLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.3 Max-Match(M2) Scorer . . . . . . . . . . . . . . . . . . . . . . 6

3 Literature Survey 73.1 Rule based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Classifier based approach . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Machine Translation approach . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . 9

3.3.2 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . 11

4 Datasets 134.1 Natural data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Statistical Machine Translation 155.1 Components of an SMT model . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.2 Translation model . . . . . . . . . . . . . . . . . . . . . . . . . 16

3

Contents 4

5.1.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Training an SMT model for GEC . . . . . . . . . . . . . . . . . . . . . . 16

6 Neural Machine Translation 176.1 Architecture of NMT model . . . . . . . . . . . . . . . . . . . . . . . . 17

6.2 Training an NMT model for GEC . . . . . . . . . . . . . . . . . . . . . 19

6.2.1 Word level translation . . . . . . . . . . . . . . . . . . . . . . . 19

6.2.2 Sub-word level translation . . . . . . . . . . . . . . . . . . . . . 19

6.2.2.1 Byte Pair Encoding (BPE) . . . . . . . . . . . . . . . . 19

6.2.2.2 Training NMT using BPE . . . . . . . . . . . . . . . . 20

7 Scoring grammaticality of machine translation hypotheses 217.1 Generating training data . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7.2 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7.2.1 Feature based regression . . . . . . . . . . . . . . . . . . . . . . 22

7.2.2 Deep-representation based regression . . . . . . . . . . . . . . . 23

8 Results and Discussions 24

9 Future work 27

10 Conclusion 28

References 29

Chapter 1

Introduction

Today, millions of people around the world are learning English. In fact, non-native

English speakers currently outnumber native speakers and their numbers will keep

increasing in the future. Non-native English speakers usually make errors in text, and

these errors may belong to different error types (see section 2.1) and also vary in their

complexity. A practical grammatical error correction (GEC) system to correct errors

in English text promises to benefit millions of learners around the world. From a

commercial perspective, there is a great potential for many practical applications, such as

proofreading tools that help non-native speakers identify and correct their writing errors

without human intervention or educational software for automated language learning and

assessment.

Publicly usable services on the Web for assisting second language learning are

growing recently. For example, there are language learning social networking services

such as Lang-8 and English grammar checkers such as Ginger, Grammarly. However,

popular commercial proofreading tools only target a few error types that are easy to

correct, such as spelling mistakes (*baeutiful/beautiful) or wrong past participle forms

of irregular verbs (*runned/run), and do not include those aspects of English that are

harder to learn. An error correction system that can only correct one or a few types

of errors will be of limited use to learners. Instead, a good system should be able to

correct a variety of error types and corrections should be performed at a global rather

than local level, including taking interacting errors into account. Also, the GEC models

can go into the pipeline of several Natural Language Generation (NLG) systems like

Machine Translation, Question Answering, Speech to text etc. to enhance the quality of

the candidate hypotheses predicted by these systems.

1

1.1 Outline 2

1.1 Outline

Chapter 2 gives an overview of grammatical error correction. It discusses the types of

grammatical errors that can occur in a sentence, properties of an ideal grammatical er-

ror correction system and popular evaluation metrics like BLEU, GLEU and Max-Match

Scorer (F-score). In chapter 3, we present a survey of traditional approaches to gram-

matical error correction. Chapter 4 describes the popular datasets used for training GEC

models. We also describe the genERRate tool used to introduce grammatical errors into

sentences. In chapter 5, we describe the components of an SMT system and their training

to generate a baseline SMT model for GEC. In chapter 6, we discuss about the architec-

ture and implementation of NMT model for GEC using word level and sub-word(Byte

pair Encoding) level translation units. In chapter 7, we explore a candidate re-ranking

technique to re-rank the candidate hypotheses generated by an NLG system. In chapter

8, we present the results of our experiments, compare them and discuss about the gen-

eral observations. In Chapter 9, we look into possible future steps for GEC. Finally, we

conclude the report in chapter 10.

Chapter 2

Overview of Grammatical ErrorCorrection

Grammatical error correction (GEC) is the task of automatically correcting grammati-

cal errors in text. More specifically, the task is to build a system that takes an input text,

analyses the context of the text to identify and correct any grammatical errors, and re-

turns a corrected version that retains the original meaning. There are different kinds of

grammatical errors that can occur in a text and it is difficult to correct some of them. The

following sections describes some of the common errors, difficulties in correcting them

and metrics used for evaluating a GEC system.

2.1 Types of grammatical errors

There can be different types of grammatical errors that can occur in a text. Some are

very frequent, some are less frequent, some adversely affects the readability and some

does not affect the readability much. The following are some of the common error types.

• Subject-Verb agreement: These errors occur when the subjects and verbs of a

sentence do not agree in person or in number.

Ex: Incorrect: He walk to college.

Correct: He walks to college.

• Verb tense: These errors occur when an incorrect verb tense is used in a sentence.

Ex: Incorrect: I have seen him yesterday.

Correct: I saw him yesterday.

3

2.1 Types of grammatical errors 4

• Noun Agreement: Nouns must agree in number to the nouns they are referencing.

This means that singular nouns must be used to refer to singular nouns and plural

nouns must be used to refer to plural nouns.

Ex: Incorrect: There are lot of restaurant in the college.

Correct: There are lot of restaurants in the college.

• Pronoun: Like subjects that agree with verbs and nouns that agree with other other

nouns, pronouns must agree in gender,person and number with their antecedent.

Ex: Incorrect: The girls won her game.

Correct: The girls won their game.

• Word form: Word form errors occur when the correct word is chosen but an incor-

rect form(part of speech) of the word is used.

Ex: Incorrect: The danger(noun) tiger ran through the village.

Correct: The dangerous(adjective) tiger ran through the village.

• Word order: These errors occur when the words are not in proper order in a sen-

tence and does not convey proper meaning.

Ex: Incorrect: He played yesterday football.Correct: He played football yesterday.

• Preposition: These errors occur due to missing or incorrect usage of preposition in

a sentence.

Ex: Incorrect: The train will arrive within five minutes.

Correct: The train will arrive in five minutes.

• Article: These errors occur due to missing or incorrect usage of article in a sen-

tence.

Ex: Incorrect: The Paris is big city.

Correct: Paris is a big city.

• Double negatives: Double negatives are two negative words used in the same sen-

tence. Using two negatives turns the thought or sentence into a positive one.

Ex: Incorrect: I can’t hardly believe.

Correct: I can hardly believe.

2.2 Properties of an ideal GEC system 5

2.2 Properties of an ideal GEC system

This section describes some of the key properties to be satisfied by an ideal grammatical

error correction system.

• Error coverage denotes the ability of a system to identify and correct a variety of

error types.

• Error complexity indicates the capacity of a system to address complex errors such

as those where multiple errors interact. An ideal GEC system should also correct

errors which depend on long range contextual information.

• Generalizibility refers to the ability of a system to identify errors in new unseen

contexts and propose corrections beyond those observed in training data.

2.3 Evaluation Metrics

Evaluation metrics allow one to measure the performance of the system. When evaluat-

ing a GEC system, the system’s output is compared to gold-standard references provided

by human experts. There are several metrics that have been proposed and used to evaluate

GEC systems. In this section, we describe the three commonly used metrics.

2.3.1 BLEU

BLEU was first proposed by Papineni et al. (2002) and is now used as the dominant

method for MT evaluation. It estimates the quality of the text produced by MT systems

so that the closer it is to professional human translation, the better.

Unlike metrics which rely on references with explicitly labelled error annotations,

BLEU only requires corrected references. Since both the original and corrected sentences

are in the same language (i.e. English) and most words in the sentence do not need

changing, BLEU scores for GEC systems are relatively high compared with standard MT

tasks. BLEU also allows multiple references, which is useful for errors with multiple

alternative corrections but it fails to provide detailed performance of the system for each

error type.

2.3 Evaluation Metrics 6

2.3.2 GLEU

Generalized Language Evaluation Understanding (GLEU), proposed by Napoles

et al. (2015), is a simple variant of BLEU for GEC which takes the original source also

into account. GLEU modifies the n-gram precision in BLEU to assign extra weight to

n-grams present in the candidate that overlap with the reference but not the source, and

penalize those in the candidate that are in the source but not the reference.

Similar to BLEU, GLEU also allows multiple references at the corpus level and fails

to provide detailed performance of the system for each error type.

2.3.3 Max-Match(M2) Scorer

M2 scorer, proposed by Dahlmeier and Ng (2012b), is used to evaluate system

performance by how well its proposed corrections or edits(ei) match the gold standard

edits(gi). It computes the sequence of phrase-level edits between a source sentence and

a system’s candidate that achieves the highest overlap with the gold standard annotation.

The system is evaluated by computing precision(P), recall(R) and F-score.

P =

∑ni=1 |ei ∩ gi|∑n

i=1 |ei|(2.1)

R =

∑ni=1 |ei ∩ gi|∑n

i=1 |gi|(2.2)

Fβ = (1 + β2) ∗P ∗ R

(1 + β2) ∗ P + R(2.3)

The M2scorer was the official scorer in the CoNLL 2013 and 2014 shared tasks on

GEC, where F1 was used in CoNLL-2013 and F0.5 was used in CoNLL-2014.

Chapter 3

Literature Survey

This chapter presents a survey of work done on grammatical error correction. It dis-

cusses about the earlier rule-based and machine learning classifier approaches and more

recent machine translation approaches to GEC.

3.1 Rule based approach

Early attempts at grammatical error correction employed hand-coded rules (Heidorn

et al. (1982); Bustamante and León (1996)). Initial rule-based systems were based on

simple pattern matching and string replacement. Gradually the rule-based systems also

incorporated syntactic analysis i.e POS tagging and Tree parsing and manually developed

grammar rules. In this approach, a set of rules is matched against a text which has at

least been POS tagged. Rule-based systems are generally very easy to implement for

certain types of errors and can be very effective. However, most of the errors are complex

and the rule-based systems fail to correct those errors. Due to the limitless ability to use

language, it becomes impossible to define rules for every possible error. As a result, most

of the current GEC systems does not employ rule-based mechanism.

3.2 Classifier based approach

With the advent of large-scale annotated data, several data-driven approaches were

developed and machine learning algorithms were applied to build classifiers to correct

specific error types (Han et al. (2004); Rozovskaya and Roth (2011)). The classifier

approach to error correction has been prominent for a long time before MT, since building

a classifier does not require having annotated learner data.

7

3.2 Classifier based approach 8

In the classifier approach, GEC is cast as a multi-class classification problem.

For correcting the errors, a finite candidate set containing all the possible correction

candidates such as list of prepositions is considered as set of class labels. Features for the

classifier include surrounding n-grams, part of speech(POS) tags, grammatical relations,

parse trees, etc. Since the most useful features for the classifier depend on the error type,

classifiers are trained individually to handle specific error type. The linguistic features

are embedded into vectors and the classifier is trained using various machine learning

algorithms. New errors can be detected and corrected using the trained classifier by

comparing the original word in the text with the most likely candidate predicted by the

classifier. Earlier, classifier approaches are mainly used to correct article and preposition

errors since these are the common errors and can be tackled easily with machine learning

approaches. Han et al. (2004) trained a maximum entropy classifier to detect article

errors and achieved an accuracy of 88%. Tetreault and Chodorow (2008) used maxi-

mum entropy models to correct errors for 34 common English prepositions in learner text.

Each classifier corrects a single word for a specific error category individually. This

ignores dependencies between the words in a sentence. Also, by conditioning on the

surrounding context, the classifier implicitly assumes that the surrounding context is free

of grammatical errors, which is often not the case. But in many real world cases sentences

may contain multiple errors of different types and these errors may depend on each other.

So a GEC system which can correct only one particular type will be of very limited use.

Over time researchers have come up with multiple solutions to address the problem of

correcting multiple errors in a sentence. One of the commonly used approach is to build

multiple classifiers one for each error type and cascade them into a pipeline. Rozovskaya

et al. (2013) proposed a combination of rule-based and classifier models to build GEC

systems which can correct multiple errors. But the above proposed approaches are useful

only when the errors are not dependent on each other and are unable to solve the problem

of dependent errors. A typical example of predictions made by the Rozovskaya et al.

(2013) system of multiple classifiers for a sentence containing dependent errors is shown

below.

Example: The dogs is barking in the street.

System predicted output: The dog are barking in the street.

Several approaches have been proposed to address the problem of interacting er-

rors. Dahlmeier and Ng (2012a) developed a beam-search decoder for correcting inter-

acting errors. The decoder iteratively generates new hypothesis corrections from current

3.3 Machine Translation approach 9

hypotheses and scores them based on features of grammatical correctness and fluency.

These features include scores from discriminative classifiers for specific error categories,

such as articles and prepositions and a language model(LM). This decoder approach out-

performed a pipeline system of individual classifiers and rule-based models.

3.3 Machine Translation approach

A practical GEC system should be able to correct various types of errors. In more

recent research, MT techniques have been used to successfully correct a broader set of

errors. Machine translation systems automatically translate text from a source language

into a target language. Grammatical error correction thus can be treated as a translation

problem from ungrammatical sentences into grammatical sentences.

3.3.1 Statistical Machine Translation

SMT has been the dominant MT approach for the past two decades. The model consists

of two components: a language model assigning a probability p(e) for any target sentence

e, and a translation model that assigns a conditional probability p(f | e). The language

model is learned using a monolingual corpus in the target language. The parameters of

the translation model are estimated from a parallel corpus, i.e. the set of foreign sentences

and their corresponding translations into the target language. Brockett et al. (2006) first

proposed the use of an SMT model for correcting a set of 14 countable/uncountable

noun errors made by learners of English. Their training data consists of a large corpus

of sentences extracted from news articles which were deliberately modified to include

typical countability errors. Artificial errors are introduced in a deterministic manner

using hand coded rules including operations such as changing quantifiers (much →

many), generating plurals (advice → advices) or inserting unnecessary determiners.

Experiments show their SMT system was generally able to beat the standard Microsoft

Word 2003 grammar checker, although it produced a relatively higher rate of erroneous

corrections.

Mizumoto et al. (2011) used the similar SMT approach that focuses on correcting

grammatical errors of Japanese learners. However, their training corpus comprised

authentic learner sentences together with corrections made by native speakers on a social

learning network website Lang-8. Their results show that the approach is a viable way

of obtaining very high performance at a relatively low cost provided a large amount of

training data is available. Mizumoto et al. (2012) investigated the effect of training corpus


size on various types of grammatical errors in English language. Their results showed

that a phrase-based SMT system is effective at correcting errors that can be identified

by a local context, but less effective at correcting errors that need long-range contextual

information. Yuan and Felice (2013) trained a POS-factored SMT system to correct five

types of errors in text for the CoNLL-2013 shared task on GEC. These five error types

involve articles, prepositions, noun number, verb form, and subject-verb agreement. The

limited training data provided for the task was not sufficient for training an effective SMT

system so they also explored alternative ways of generating pairs of incorrect and correct

sentences automatically from other existing learner corpora.

In the CoNLL-2014 shared task on GEC, several top performing systems employed

hybrid approaches. Felice et al. (2014) proposed a pipeline of rule-based system and

a phrase-based SMT system augmented by a large web based language model. The

generated candidates are ranked using the language model(LM), with the most probable

candidate being selected as the final corrected version. Susanto (2015) made an attempt at

combining MT and classifier models. They used CoNLL-train and Lang-8 as non-native

data and English Wikipedia as native data. Junczys-Dowmunt and Grundkiewicz (2014)

also employed the phrase-based SMT approach. They used the word-level Levenshtein

distance between source and target as a translation model feature. To increase the

precision, they tuned the feature weights for F-score using the k-best Margin Infused

Relaxed Algorithm (MIRA) Cherry and Foster (2012) and Minimum Error Rate Tuning

(MERT) Och (2003). Kunchukuttan et al. (2014) subsequently found that tuning for

F-score to increase precision yielded worse performance.

More recently, Napoles and Callison-Burch (2017) proposed a light weight approach

to GEC called Specialized Machine translation for Error Correction (SMEC) which

represents a single model that handles morphological changes, spelling corrections, and

phrasal substitutions. This model is developed by examining different aspects of the

SMT pipeline, identifying and applying modifications tailored for GEC, introducing

artificial data, and evaluating how each of these specializations contributes to the overall

performance. The analysis provided in this work will help improve future efforts in GEC,

and can be used to inform approaches rooted in both neural and statistical MT.

Other approaches using machine translation for error correction are not aimed at

training SMT systems but rather at using them as auxiliary tools for producing round-trip

translations (i.e. translations into a pivot foreign language and back into English). Hermet


and Desilets (2009) focused on sentences containing preposition errors and generated a

round-trip translation via French and their model was able to correct 66.4% of errors.

Madnani et al. (2012) used round-trip translations obtained from the Google Translate

API via 8 different pivot languages for an all-errors task.

3.3.2 Neural Machine Translation

Recently, neural machine translation (NMT) systems have achieved substantial

improvements in translation quality over phrase-based MT systems (Sutskever et al.

(2014); Bahdanau et al. (2014)). Thus, there is growing interest in applying neural

systems to GEC. Sun et al. (2015), employed a Convolutional Neural Network (CNN) for

article error correction. Instead of building classifiers using pre-defined syntactic and/or

semantic features, a CNN model is trained from surrounding words with pre-trained word

embeddings. Yuan and Briscoe (2016) proposed a neural machine translation approach

for GEC using recurrent neural network to perform sequence-to-sequence mapping from

erroneous to well-formed sentences.

The core component of most NMT systems is a sequence-to-sequence (seq2seq)

model which consists of an encoder and a decoder. An encoder encodes a sequence of

source words into a vector and then a decoder generates a sequence of target words from

the vector. Different network architectures have been proposed for NMT. Sutskever et al.

(2014) and Cho et al. (2014) used RNNs for both encoding and decoding. Sutskever et al.

(2014) used a multilayer Long Short-Term Memory (LSTM) to encode a source sentence

into a fixed-sized vector, and another LSTM to decode a target sentence from the vector

whereas Cho et al. (2014) used two Gated Recurrent Unit (GRU) models, one as the

encoder and another as the decoder.

Unlike the phrase-based SMT models, the seq2seq model can capture long-distance,

or even global, word dependencies, which are crucial to correcting global grammatical

errors. In order to achieve better performance on GEC, a seq2seq model has to address

several task-specific challenges like dealing with an extremely large vocabulary size, and

capturing structure at different levels of granularity in order to correct errors of different

types. For example, while correcting spelling and local grammar errors requires only

word-level or sub-word level information, e.g., violets→ violates (spelling) or violate→

violates (verb form), correcting errors in word order or usage requires global semantic

relationships among phrases and words. Yuan and Briscoe (2016), addressed the large

vocabulary problem by restricting the vocabulary to a limited number of high-frequency

words and re-sorting to standard word translation dictionaries to provide translations for


the words that are out of the vocabulary (OOV). However, this approach often fails to

take into account the OOVs in context for making correction decisions, and does not

generalize well to correcting words that are unseen in the parallel training data. An

alternative approach, proposed by Xie et al. (2016), applies a character-level sequence

to sequence neural model. Although the model eliminates the OOV issue, it cannot

effectively leverage word-level information for GEC, even if it is used together with a

separate word-based language model.

Bahdanau et al. (2014) and Luong et al. (2015) used attention mechanism in NMT

and have shown that attention-based models are better than non-attentional ones in han-

dling long sentences. Ji et al. (2017) proposed a hybrid neural model with nested attention

layers for GEC. It consists of a word-based seq2seq model as a back-bone which closely

follows the basic neural seq2seq architecture with attention as proposed by Bahdanau

et al. (2014), and additional character-level encoder, decoder, and attention components,

which focus on words that are outside the word-level model vocabulary. This nested atten-

tion model is shown to be very effective at correcting global word errors and significantly

outperforms previous neural models for GEC as measured on the standard CoNLL-14

benchmark dataset.

Chapter 4

Datasets

Learner corpora for GEC are produced by non-native English speakers. There are two

broad categories of parallel data for GEC. The first is error-coded text, in which annotators

have coded spans of learner text containing an error. The second class of GEC corpora

are parallel datasets, which contain the original text and a corrected version of the text,

without explicitly coded error corrections.

Synthetic learner corpora for GEC have also been developed by artificially introducing

errors into the grammatically correct sentences.

4.1 Natural data

There are lot of publicly available datasets for both error annotated corpora and parallel

corpora. Some of the commonly used error annotated corpora include the NUS Corpus of

Learner English (NUCLE; 57k sentence pairs) (Dahlmeier et al. (2013)), the Cambridge

Learner Corpus (CLC; 1.9M pairs) (Nicholls (2003)), and a subset of the CLC, the First

Certificate in English (FCE; 34k pairs) (Yannakoudakis et al. (2011)). MT systems are

trained on parallel text, which can be extracted from error-coded corpora by applying the

annotated corrections, resulting a clean corpus with nearly-perfect word and sentence

alignments.

Two popular parallel corpora are the Automatic Evaluation of Scientific Writing

corpus, with more than 1 million sentences of scientific writing corrected by professional

proofreaders (Daudaravicius et al. (2016)), and the Lang-8 Corpus of Learner English,

which contains 1 million sentence pairs scraped from an online forum for language

learners, which were corrected by other members of the lang-8.com online community.

13

4.2 Synthetic data 14

In our experiments, we used the Lang-8 parallel corpus and Synthetic Brown corpus

(see section 4.2) in generating an MT system for GEC.

4.2 Synthetic data

Artificial errors have been employed previously in targeted error detection. Sjöbergh

and Knutsson (2005) introduced split compound errors and word order errors into

Swedish texts and used the resulting artificial data to train their error detection system.

Brockett et al. (2006) introduced errors involving mass/count noun confusions into En-

glish news wire text and then used the resulting parallel corpus to train a phrasal SMT

system to perform error correction.

We generated synthetic data by introducing errors into Brown corpus using the

GenERRate error generation tool (Foster and Andersen (2009)). This tool takes a corpus

and an error analysis file consisting of a list of errors as input and produces an error

tagged corpus of ungrammatical sentences. The errors are introduced by inserting,

deleting, moving and substituting POS tagged words in a sentence as mentioned in the

error configuration file.

Example:

The couple was married Aug. 2 , 1913 .

↓

GenERRate

↓

The couples was married Aug. 2 , 1913 .

The couple was marrying Aug. 2 , 1913 .

The couple was marries Aug. 2 , 1913 .

The couple were married Aug. 2 , 1913 .

Using this tool, we generated a parallel corpus containing approximately 900k sentence

pairs and used them for training an MT system for GEC.

Chapter 5

Statistical Machine Translation

In this chapter, we describe the structure of the SMT model and try to apply it for GEC

by training the model using the Lang-8 and Synthetic brown data (see section 4.2).

The statistical machine translation approach is based on the noisy-channel model.

The best translation e∗ for a foreign sentence f is:

e∗ = argmaxe

p(e| f ) = argmaxe

p( f |e)p(e) (5.1)

There are three main components in SMT model for computing each of the param-

eters p(e), p(f|e), e∗. Each of the components are described in the section below.

5.1 Components of an SMT model

The three components of SMT model are language model (LM), translation model

(TM) and decoder.

5.1.1 Language model

A language model (LM) is a function that takes an English sentence as input and

returns the probability that it is a valid English sentence. It computes the parameter p(e)

in eq(5.1). N-gram language models are commonly used in SMT systems.

For a sentence e = { w1,w2,....,wm}, the N-gram LM computes the probability p(e)

by computing the conditional probabilities for each of the word wi in the sentence.

p(e) = p(w1, w2, ...., wm) =

m∏i=1

p(wi|wi−n+1, ...., wi−1) (5.2)

where

p(wi|wi−n+1, ...., wi−1) =#(wi−n+1, ...., wi)

#(wi−n+1, ...., wi−1)(5.3)

15

5.2 Training an SMT model for GEC 16

5.1.2 Translation model

A translation model (TM) gives an estimation of the lexical correspondence between

languages. It computer the parameter p(f|e) in eq(5.1). The translation models are trained

using parallel corpora. The parallel corpora are sentence aligned and then word aligned

using a set of statistical models developed at IBM in the 80s. These word alignments are

used to extract phrase-phrase translations, or hierarchical rules as required, and corpus-

wide statistics on these rules are used to estimate probabilities.

5.1.3 Decoder

Decoder chooses the best translation e∗ from a pool of all possible candidate transla-

tions i.e it chooses that hypothesis e such that p(f|e)p(e) is maximum. Beam search is one

of the popular decoders used in SMT systems.

5.2 Training an SMT model for GEC

In our experiment, SMT-based GEC system is built using Moses, an open source toolkit

for SMT developed by Koehn et al. (2007). We used Lang-8 data and Synthetic brown

data (see section 4.2) for training, tuning and testing the SMT model.

#sentence pairs

Lang-8 938495

Synthetic brown 815921

Total 1754416

Train 1300000

Tune 100000

Test 354416

Table 5.1: Train,tune,test split for SMT

Word alignments are generated using MGIZA++, multi-thread implementation of

GIZA++ (Och and Ney (2003)) and are used to construct phrase tables. Translation mod-

els are generated using the train split containing 1300k sentence pairs. For generating the

LMs (upto 5-gram), we used KenLM (Heafield (2011)) toolkit and grammatically correct

sentences in both train and tune splits. Results of this experiment are discussed in chapter

8.

Chapter 6

Neural Machine Translation

NMT as an MT approach to GEC has shown promising results (Sutskever et al. (2014);

Bahdanau et al. (2014)). Compared with conventional SMT, NMT has several advantages.

First, unlike SMT which consists of components that are trained separately and combined

during decoding (see section 5.1), NMT learns a single large neural network which inputs

a source sentence and outputs a translation. As a result, training NMT systems for end-to-

end translation tasks is much simpler than building SMT systems, which requires multiple

processing steps. Second, an NMT system can learn translation regularities directly from

the training data, without the need to explicitly design features to capture them, which

is quite difficult in SMT. The typical architecture of an NMT model is described in the

section below.

6.1 Architecture of NMT model

NMT applies an encoder-decoder framework. An encoder first reads a variable length

input sentence and encodes all (or parts) of the input sentence into a vector representation.

A decoder then outputs a translation for the input sentence from the vector representation.

Given a source sentence X = { x1,x2,...,xm}, and a target sentence Y = { y1,y2,...,yn},

NMT models the conditional probability of translating the source sentence X to the target

sentence Y as:

p(y1, y2..., yn|x1, x2, ..., xm) (6.1)

In this section, we describe the OpenNMT (Klein et al. (2017)) architecture (see fig

6.1) we used in our experiments.

In our experiments, we used a Bidirectional LSTM (Bi-LSTM) encoder which con-

sists of a forward LSTM layer and a backward LSTM layer.The forward LSTM reads the

input sentence from the first word to the last word (from x1 to xm ), and the backward

17

6.1 Architecture of NMT model 18

Figure 6.1: OpenNMT architecture, an encoder-decoder model with Attention mechanism

LSTM reads the input sentence in reverse order (from xm to x1). By using a Bi-LSTM,

both historical and future information is captured. The intermediate vector is generated

by concatenating the outputs of forward and backward LSTM layers and then applying a

global attention model over the entire source sequence (see fig 6.2). An attention mecha-

nism is employed to help the decoder focus on the most relevant information in the input

sentence, instead of remembering the entire input sentence. A simple LSTM layer is used

for decoding the target sentence from the intermediate vector.

Figure 6.2: Attention mechanism in OpenNMT model

6.2 Training an NMT model for GEC 19

6.2 Training an NMT model for GEC

Given a corpus of parallel sentences, an NMT system is trained to maximise log-

likelihood:

maxθ

N∑n=1

log P(Yn|Xn, θ) = maxθ

N∑n=1

T ′∑t=1

log P(ynt |y

n1, y

n2, ..., y

nt−1, X

n, θ) (6.2)

where θ = [θenc, θdec] represents all the parameters, N is the total number of training

examples in the corpus and (Xn, Yn) is the nth pair. We maximize the log-likelihood

using Stochastic Gradient Descent (SGD) with decaying learning rate. We performed

experiments using word level and sub-word level translation units, which are described in

the following sections. For training the NMT models, we used the Torch implementation

of OpenNMT framework. Results of these experiments are discussed in chapter 8.

6.2.1 Word level translation

During preprocessing, tokenization is done to transform sentences into sequences

of tokens. In word level translation, this tokenization is done to generate tokens, each of

which is a word in a sentence.

We trained two NMT models, one using only the natural Lang-8 data and another

using both Lang-8 and synthetic brown data.

6.2.2 Sub-word level translation

In sub-word level translation, tokenization is done to generate tokens, each of which

is a character n-gram. Several sub-word level translation units include character-level

units, syllable level units and Byte Pair Encoding (BPE) units. We trained our NMT

models using the BPE units.

6.2.2.1 Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) (Shibata et al. (1999)) is a simple data compression

technique that iteratively replaces the most frequent pair of bytes in a sequence with a

single, unused byte. We adapt this algorithm for word segmentation. Instead of merging

frequent pairs of bytes, we merge characters or character sequences.

Firstly, we initialize the symbol vocabulary with the character vocabulary, and rep-

resent each word as a sequence of characters, plus a special end-of- word symbol, which

6.2 Training an NMT model for GEC 20

allows us to restore the original tokenization after translation. We iteratively count all

symbol pairs and replace each occurrence of the most frequent pair (’A’, ’B’) with a new

symbol ’AB’. Each merge operation produces a new symbol which represents a character

n-gram. Frequent character n-grams (or whole words) are eventually merged into a single

symbol. For efficiency, we do not consider pairs that cross word boundaries. So, the algo-

rithm can be run on the dictionary extracted from a text, with each word being weighted

by its frequency.

Example 6.1 Sample BPE generation example

Vocabulary : ’low’, ’lowest’, ’newer’, ’wider’

BPE units (’.’ is end-of-word symbol ) :

r . → r.

l o→ lo

lo w→ low

e r. → er.

BPE level translation models were shown to perform well for translation between re-

lated languages. We can cast GEC as a task of translation between two highly related

languages having same vocabulary but not the grammatical structure.

6.2.2.2 Training NMT using BPE

We trained an NMT model using BPE units. Both Lang-8 and synthetic brown data

were used for training. The OpenNMT framework also supports the sub-word level to-

kenization in addition to word level tokenization and generates the BPE units from the

parallel data. For our experiment, BPE units were generated using 30k merge operations

and the models were trained with same parameters as that of word level translation.

Chapter 7

Scoring grammaticality of machinetranslation hypotheses

Over the last decade, re-ranking techniques, especially discriminative re-ranking, have

shown significant improvement in MT systems not only for GEC but also for any other

translation tasks (Hoang et al. (2016); Yuan et al. (2016)). For each source sentence,

rather than outputting the candidate with the highest probability directly, an n-best list

of candidate translations is collected from an MT system and later re-ranked using

re-ranking algorithms. Various machine learning algorithms have been adapted to these

re-ranking tasks, including boosting, perceptrons and SVMs.

In this chapter, we discuss about our experiments on a re-ranking technique to re-

rank the candidate hypotheses of any NLG system by assigning a grammaticality score to

each candidate hypothesis. Lower the score, the better (more grammatical) the candidate.

Our scoring system is a supervised regressor which trains on a large number of sentences

with varying grammaticality. In the following sections, we discuss the type of regression

models that we have experimented on and also the process of generating training data for

these models.

7.1 Generating training data

To make our system as generic as possible, we wanted our training data to be as di-

verse as possible, encompassing different levels of grammatical errors. So we used the

sentences from test split of SMT system (see section 5.2) and their 3-best candidate trans-

lations from our SMT model for GEC (see chapter 5) which accounts for a total of ap-

proximately 1200k sentences.

21

7.2 Regression models 22

For each instance in our training data, we obtained the grammaticality scores by

computing an edit measure that indicates how much effort it will take to edit the incorrect

sentence to bring it to its correct form; higher the effort, less is the grammaticality. We

improvise the edit scores by assigning a score to each word in the source side (phrase

from source sentence) and target side (phrase from target side) of the edit based on its

POS tag and then picking the maximum of sum of scores in each side of the edit.

Example 7.1 Improvised edit score for a sample sentence

Source : He are a going by the office .

Target : He is going to the office .

edit 1 : (’are’ 3.0; ’a’ 1.0→ ’is’ 3.0)

edit 1 score = max(3.0+1.0, 3.0) = 4.0

edit 2 : (’by’ 2.0→ ’to’ 2.0)

edit 2 score= max(2.0, 2.0) = 2.0

Grammaticality score = 4.0 + 2.0 = 6.0

We then explore various traditional feature-based and deep-representation-based re-

gression techniques for training the model.

7.2 Regression models

Once the training data is created, we explore a few regression frameworks follow-

ing traditional feature based paradigm such as (a) Linear Regression (LR) (b) Support

Vector Regression (SVR) (c) Regression using Multilayer Perceptron (MLP), and a deep-

representation learning based paradigm based on long short term memory (LSTM). For

training the models, we used a train:test split of 80:20% on the above generated training

data. We used Pearson correlation coefficient as evaluation metric to find the correlation

between the true and predicted values.

7.2.1 Feature based regression

For implementing feature based regressors (SVR, LR, MLP), the features that we used

are: (a) averaged-out word embeddings (of dimension 300) of the input text obtained

using SpaCy (an open source library for NLP) and (b) Log like-likelihood of the sentence

obtained using a 5-gram language model generated while training the SMT system (see

section 5.1.1). For SVR, we try using Linear, Polynomial (degree=2) and RBF kernels

with default parameters. For MLP, the hidden layer size was empirically set to 8.

7.2 Regression models 23

7.2.2 Deep-representation based regression

For deep-representation based regression, we implemented several recurrent neu-

ral network (RNN) variants using LSTM as the basic unit. The first variant uses a

stack of 2 LSTM layers on top of an input embedding layer followed by a dense

layer with linear activation. The second variant is same as the first, except that it uses a

single bidirectional LSTM layer, trying to capture contextual information from both sides.

For all the above specified configurations, we used dropout of 0.25 at the penultimate

layer, mean squared logarithmic error as the loss function and adam as the optimizer.

Chapter 8

Results and Discussions

In this chapter, we discuss the results of our baseline SMT, word level NMT, sub-

word level NMT models and also results of our grammaticality scoring models that we

described in chapter 7.

We use the BLEU, GLEU, M2 scorer (F0.5) evaluation metrics to evaluate our MT

models. Table 8.1 and Table 8.2 shows the results of our MT-based GEC models on

validation data using in our experiments and CoNLL-2014 test data.

Model BLEU GLEU

SMT 83.2 73.54

Word level NMT (no synthetic data) 75.97 67.99

Word level NMT (using synthetic data) 79.63 72.71

Sub-word level NMT 52.7 47.26

Table 8.1: Performance of MT models on validation data

Model BLEU GLEU F0.5 score

Our Models

SMT 84.96 56.84 22.63

Word level NMT (no synthetic data) 83.47 56.89 15.71

Word level NMT (using synthetic data) 86.52 58.11 21.2

Sub-word level NMT 79.45 55.7 16.74

Top 3 systems in CoNLL-2014 shared task

CAMB - - 37.33

CUUI - - 36.79

AMU - - 35.01

Table 8.2: Performance of MT models on CoNLL-2014 test data

24

25

From Table 8.1, we can see that SMT system is performing well on validation

data when compared to NMT models. It might have seen a lot of necessary correction

mappings for validation data in the training data. Also, we can see from Table 8.2 that

it is performing well on CoNLL-2014 test data in terms of F-score whereas the NMT

model trained using both Lang-8 and synthetic brown data gives higher values for BLEU

and GLEU.

Let’s call our NMT model trained using only Lang-8 data as NMT1, using both

synthetic and Lang-8 data as NMT2 and sub-word level NMT model as NMT3.

Example 8.1 Correcting errors that depend on local contextual information

Source : I going the to college .

SMT : I go to college .

NMT1 : I ’m going to college .

NMT2 : I am going to college .

NMT3 : I am going to college .

Reference: I am / ’m going to the college .

Example 8.2 Correcting errors that depend on global contextual information

Source : The player in the ground are talking to each other .

SMT : The player in the ground are talking to each other .

NMT1 : The player in the ground is talking to each other .

NMT2 : The player in the ground are talking to each other .

NMT3 : The players in the ground are talking to each other .

Reference: The players in the ground are talking to each other .

From the above examples, we can see that our models are able to correct the errors

that depend on local context but they find difficulty in correcting the global errors.

Table 8.3 shows the performance of our scoring models described in chapter 7

on validation data using Pearson correlation coefficient. We can see that the scores

predicted by our scoring models are not correlating well with the improvised edit

scores (see section 7.1). The scores are predicted by looking at the whole context

of the sentence. The averaged word embeddings that we input for the feature-based

regression models will be very similar for all the candidate hypotheses predicted by an

NLG system for an input sentence. So it is possible that the Feature-based regression

models are not able to identify the minor differences between the candidate hypotheses.

26

Model Pearson

Feature based regression models

SVR (RBF kernel) 0.0033

SVR (Polynomial kernel) 0.006

SVR (Linear kernel) 0.019

Linear Regression 0.12

MLP 0.13

Deep-representation based regression models

2 LSTM layers 0.024

Bi-LSTM layer 0.032

Table 8.3: Performance of our scoring models on validation data

This problem also persists in deep-representation models and moreover the presence

of additional context in a sentence also add some amount of bias to the score (see Ex: 8.3).

Example 8.3 Bias due to the additional context

Sentence A : I is going to the school . → Score = 2.33

Sentence B : I is going to the school on foot . → Score = 4.2

Chapter 9

Future work

In our future work, we will develop on top of our baseline MT models to improve the

performance of GEC system in correcting complex errors. We propose a few ideas which

can improve the quality of the GEC system.

• Using GEC evaluation metrics to train NMT model :Currently, NMT models are trained to maximize the log-likelihood (see section

6.2). We propose to train the NMT models to maximize the GEC evaluation

metrics like F0.5 score and GLEU instead of log-likelihood. Since these metrics are

not differentiable, we have to train the models in a policy gradient way.

• Multi-task learning :In GEC, most of the errors are interacting errors and may depend on long range

contextual information. MT models finds difficulty in learning the global contextual

information for correcting these errors. Training the model for multiple tasks like

GEC, POS tagging, etc. simultaneously might improve the performance of the GEC

system. Through multi-task learning, the system can learn better global features

which aids in correcting the complex errors.

27

Chapter 10

Conclusion

In this report, we discussed about grammatical error correction (GEC) in non-native

English text. We treated GEC as a translation task from ungrammatical to grammatical

English. We presented a survey of traditional approaches to GEC i.e rule-based,

classifier-based and machine translation approaches. We also discussed about the

genERRate tool to generate synthetic parallel data using Brown corpus.

We implemented a baseline SMT model using Moses framework and, both word

level and sub-word level NMT models using OpenNMT framework. We trained our

models using Lang-8 and Synthetic Brown datasets. We discussed about Byte Pair

Encoding and its applications in sub-word level NMT models. We have shown that our

baseline MT models finds difficulty in correcting global errors and also compared them

with the top 3 systems of CoNLL-2014 shared task on GEC.

We experimented on a candidate re-ranking technique to re-rank the candidate

hypotheses of any NLG system by assigning a grammaticality score, but this technique

was not giving satisfactory results. Finally, we concluded with possible future work in

GEC using MT models and multi-task learning.

28

References

Bahdanau, D., Cho, K., and Bengio, Y., 2014, “Neural machine translation by jointly

learning to align and translate,” arXiv preprint arXiv:1409.0473

Brockett, C., Dolan, W. B., and Gamon, M., 2006, “Correcting esl errors using phrasal

smt techniques,” in Proceedings of the 21st International Conference on Computational

Linguistics and the 44th annual meeting of the Association for Computational Linguis-

tics (Association for Computational Linguistics). pp. 249–256.

Bustamante, F. R., and León, F. S., 1996, “Gramcheck: A grammar and style checker,” in

Proceedings of the 16th conference on Computational linguistics-Volume 1 (Associa-

tion for Computational Linguistics). pp. 175–181.

Cherry, C., and Foster, G., 2012, “Batch tuning strategies for statistical machine trans-

lation,” in Proceedings of the 2012 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies (Associa-

tion for Computational Linguistics). pp. 427–436.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,

and Bengio, Y., 2014, “Learning phrase representations using rnn encoder-decoder for

statistical machine translation,” arXiv preprint arXiv:1406.1078

Dahlmeier, D., and Ng, H. T., 2012a, “A beam-search decoder for grammatical error cor-

rection,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning (Association for

Computational Linguistics). pp. 568–578.

Dahlmeier, D., and Ng, H. T., 2012b, “Better evaluation for grammatical error correction,”

in Proceedings of the 2012 Conference of the North American Chapter of the Associ-

ation for Computational Linguistics: Human Language Technologies (Association for


29

References 30

Dahlmeier, D., Ng, H. T., and Wu, S. M., 2013, “Building a large annotated corpus of

learner english: The nus corpus of learner english..” in BEA@ NAACL-HLT, pp. 22–

31.

Daudaravicius, V., Banchs, R. E., Volodina, E., and Napoles, C., 2016, “A report on the

automatic evaluation of scientific writing shared task..” in BEA@ NAACL-HLT, pp.

53–62.

Felice, M., Yuan, Z., Andersen, Ø. E., Yannakoudakis, H., and Kochmar, E., 2014,

“Grammatical error correction using hybrid systems and type filtering..” in CoNLL

Shared Task, pp. 15–24.

Foster, J., and Andersen, Ø. E., 2009, “Generrate: generating errors for use in grammatical

error detection,” in Proceedings of the fourth workshop on innovative use of nlp for

building educational applications (Association for Computational Linguistics). pp. 82–

90.

Han, N.-R., Chodorow, M., and Leacock, C., 2004, “Detecting errors in english article

usage with a maximum entropy classifier trained on a large, diverse corpus..” in LREC

Heafield, K., 2011, “Kenlm: Faster and smaller language model queries,” in Proceedings

of the Sixth Workshop on Statistical Machine Translation (Association for Computa-

tional Linguistics). pp. 187–197.

Heidorn, G. E., Jensen, K., Miller, L. A., Byrd, R. J., and Chodorow, M. S., 1982, “The

epistle text-critiquing system,” IBM Systems Journal 21, 305–326.

Hoang, D. T., Chollampatt, S., and Ng, H. T., 2016, “Exploiting n-best hypothe-

ses to improve an smt approach to grammatical error correction,” arXiv preprint

arXiv:1606.00210

Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., and Gao, J., 2017, “A

nested attention neural hybrid model for grammatical error correction,” arXiv preprint

arXiv:1707.02026

Junczys-Dowmunt, M., and Grundkiewicz, R., 2014, “The amu system in the conll-2014

shared task: Grammatical error correction by data-intensive and feature-rich statistical

machine translation,” in Proceedings of the Eighteenth Conference on Computational

Natural Language Learning: Shared Task, pp. 25–33.

Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M., 2017, “Opennmt: Open-

source toolkit for neural machine translation,” arXiv preprint arXiv:1701.02810

References 31

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan,

B., Shen, W., Moran, C., Zens, R., et al., 2007, “Moses: Open source toolkit for sta-

tistical machine translation,” in Proceedings of the 45th annual meeting of the ACL on

interactive poster and demonstration sessions (Association for Computational Linguis-

tics). pp. 177–180.

Kunchukuttan, A., Chaudhury, S., and Bhattacharyya, P., 2014, “Tuning a grammar cor-

rection system for increased precision..” in CoNLL Shared Task, pp. 60–64.

Luong, M.-T., Pham, H., and Manning, C. D., 2015, “Effective approaches to attention-

based neural machine translation,” arXiv preprint arXiv:1508.04025

Madnani, N., Tetreault, J., and Chodorow, M., 2012, “Exploring grammatical error cor-

rection with not-so-crummy machine translation,” in Proceedings of the Seventh Work-

shop on Building Educational Applications Using NLP (Association for Computational

Linguistics). pp. 44–53.

Mizumoto, T., Hayashibe, Y., Komachi, M., Nagata, M., and Matsumoto, Y., 2012, “The

effect of learner corpus size in grammatical error correction of esl writings,” Proceed-

ings of COLING 2012: Posters, 863–872.

Mizumoto, T., Komachi, M., Nagata, M., and Matsumoto, Y., 2011, “Mining revision log

of language learning sns for automated japanese error correction of second language

learners..” in IJCNLP, pp. 147–155.

Napoles, C., and Callison-Burch, C., 2017, “Systematically adapting machine translation

for grammatical error correction,” in Proceedings of the 12th Workshop on Innovative

Use of NLP for Building Educational Applications, pp. 345–356.

Napoles, C., Sakaguchi, K., Post, M., and Tetreault, J., 2015, “Ground truth for gram-

matical error correction metrics,” in Proceedings of the 53rd Annual Meeting of the

Association for Computational Linguistics and the 7th International Joint Conference

on Natural Language Processing, Vol. 2, pp. 588–593.

Nicholls, D., 2003, “The cambridge learner corpus: Error coding and analysis for lexi-

cography and elt,” in Proceedings of the Corpus Linguistics 2003 conference, Vol. 16,

pp. 572–581.

Och, F. J., 2003, “Minimum error rate training in statistical machine translation,” in Pro-

ceedings of the 41st Annual Meeting on Association for Computational Linguistics-

Volume 1 (Association for Computational Linguistics). pp. 160–167.

References 32

Och, F. J., and Ney, H., 2003, “A systematic comparison of various statistical alignment

models,” Computational linguistics 29, 19–51.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J., 2002, “Bleu: a method for automatic

evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the

Association for Computational Linguistics, pp. 311–318.

Rozovskaya, A., Chang, K.-W., Sammons, M., and Roth, D., 2013, “The university of

illinois system in the conll-2013 shared task,” in Proceedings of the Seventeenth Con-

ference on Computational Natural Language Learning: Shared Task, pp. 13–19.

Rozovskaya, A., and Roth, D., 2011, “Algorithm selection and model adaptation for esl

correction tasks,” in Proceedings of the 49th Annual Meeting of the Association for

Computational Linguistics: Human Language Technologies-Volume 1 (Association for


Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., and

Arikawa, S., 1999, Byte Pair encoding: A text compression scheme that accelerates

pattern matching, Tech. Rep. (Technical Report DOI-TR-161, Department of Infor-

matics, Kyushu University).

Sjöbergh, J., and Knutsson, O., 2005, “Faking errors to avoid making errors: Very weakly

supervised learning for error detection in writing,” in Proceedings of RANLP, pp. 506–

512.

Sun, C., Jin, X., Lin, L., Zhao, Y., and Wang, X., 2015, “Convolutional neural networks

for correcting english article errors,” in Natural Language Processing and Chinese

Computing (Springer). pp. 102–110.

Susanto, R. H., 2015, Systems Combination for Grammatical Error Correction, Ph.D.

thesis

Sutskever, I., Vinyals, O., and Le, Q. V., 2014, “Sequence to sequence learning with neural

networks,” in Advances in neural information processing systems, pp. 3104–3112.

Tetreault, J. R., and Chodorow, M., 2008, “The ups and downs of preposition error detec-

tion in esl writing,” in Proceedings of the 22nd International Conference on Computa-

tional Linguistics-Volume 1 (Association for Computational Linguistics). pp. 865–872.

Xie, Z., Avati, A., Arivazhagan, N., Jurafsky, D., and Ng, A. Y., 2016, “Neural language

correction with character-based attention,” arXiv preprint arXiv:1603.09727

References 33

Yannakoudakis, H., Briscoe, T., and Medlock, B., 2011, “A new dataset and method for

automatically grading esol texts,” in Proceedings of the 49th Annual Meeting of the

Association for Computational Linguistics: Human Language Technologies-Volume 1

(Association for Computational Linguistics). pp. 180–189.

Yuan, Z., and Briscoe, T., 2016, “Grammatical error correction using neural machine

translation..” in HLT-NAACL, pp. 380–386.

Yuan, Z., Briscoe, T., and Felice, M., 2016, “Candidate re-ranking for smt-based gram-

matical error correction..” in BEA@ NAACL-HLT, pp. 256–266.

Yuan, Z., and Felice, M., 2013, “Constrained grammatical error correction using statistical

machine translation..” in CoNLL Shared Task, pp. 52–61.

Documents

GRAMMATICAL ERROR CORRECTION · 2017-12-07 · B.Tech Project Stage 1 Report Submitted in partial fulﬁllment of the requirements for the degree of Bachelor of Technology (Honors)