Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
GRAMMATICAL ERROR CORRECTION
B.Tech Project Stage 1 ReportSubmitted in partial fulfillment of
the requirements for the degree ofBachelor of Technology (Honors)
by
G. Krishna Chaitanya(140050038)
Supervisor:Prof. Pushpak Bhattacharyya
Department of Computer Science and Engineering
Indian Institute of Technology BombayMumbai 400076 (India)
14 November 2017
Abstract
Grammatical error correction (GEC) is the task of automatically correcting grammatical
errors in written text. Earlier attempts to grammatical error correction involve rule-based
and classifier approaches which are limited to correcting only some particular type of
errors in a sentence. As sentences may contain multiple errors of different types, a
practical error correction system should be able to detect and correct all errors.
In this report, we investigate GEC as a translation task from incorrect to correct En-
glish and explore some machine translation approaches for developing end-to-end GEC
systems for all error types. We apply Statistical Machine Translation (SMT) and Neural
Machine Translation (NMT) approaches to GEC and show that they can correct multiple
errors of different types in a sentence when compared to the earlier methods which focus
on individual errors. We also discuss some of the weakness of machine translation ap-
proaches. Finally, we also experiment on a candidate re-ranking technique to re-rank the
hypotheses generated by machine translation systems. With regression models, we try to
predict a grammaticality score for each candidate hypothesis and re-rank them according
to the score.
1
Acknowledgement
I would like to extend my heartfelt gratitude to my guide, Prof. Pushpak Bhattacharyya
and my co-guide, Abhijit Mishra for their constant guidance and support throughout the
project. I am extremely grateful to them for spending their valuable time with me for
clarifying my doubts whenever I approached them.
G. Krishna ChaitanyaIIT Bombay
14 November 2017
2
Contents
Abstract 1
Acknowledgement 2
1 Introduction 11.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Overview of Grammatical Error Correction 32.1 Types of grammatical errors . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Properties of an ideal GEC system . . . . . . . . . . . . . . . . . . . . . 5
2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 GLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Max-Match(M2) Scorer . . . . . . . . . . . . . . . . . . . . . . 6
3 Literature Survey 73.1 Rule based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Classifier based approach . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Machine Translation approach . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . 9
3.3.2 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . 11
4 Datasets 134.1 Natural data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Statistical Machine Translation 155.1 Components of an SMT model . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.1 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.2 Translation model . . . . . . . . . . . . . . . . . . . . . . . . . 16
3
Contents 4
5.1.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Training an SMT model for GEC . . . . . . . . . . . . . . . . . . . . . . 16
6 Neural Machine Translation 176.1 Architecture of NMT model . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Training an NMT model for GEC . . . . . . . . . . . . . . . . . . . . . 19
6.2.1 Word level translation . . . . . . . . . . . . . . . . . . . . . . . 19
6.2.2 Sub-word level translation . . . . . . . . . . . . . . . . . . . . . 19
6.2.2.1 Byte Pair Encoding (BPE) . . . . . . . . . . . . . . . . 19
6.2.2.2 Training NMT using BPE . . . . . . . . . . . . . . . . 20
7 Scoring grammaticality of machine translation hypotheses 217.1 Generating training data . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2.1 Feature based regression . . . . . . . . . . . . . . . . . . . . . . 22
7.2.2 Deep-representation based regression . . . . . . . . . . . . . . . 23
8 Results and Discussions 24
9 Future work 27
10 Conclusion 28
References 29
Chapter 1
Introduction
Today, millions of people around the world are learning English. In fact, non-native
English speakers currently outnumber native speakers and their numbers will keep
increasing in the future. Non-native English speakers usually make errors in text, and
these errors may belong to different error types (see section 2.1) and also vary in their
complexity. A practical grammatical error correction (GEC) system to correct errors
in English text promises to benefit millions of learners around the world. From a
commercial perspective, there is a great potential for many practical applications, such as
proofreading tools that help non-native speakers identify and correct their writing errors
without human intervention or educational software for automated language learning and
assessment.
Publicly usable services on the Web for assisting second language learning are
growing recently. For example, there are language learning social networking services
such as Lang-8 and English grammar checkers such as Ginger, Grammarly. However,
popular commercial proofreading tools only target a few error types that are easy to
correct, such as spelling mistakes (*baeutiful/beautiful) or wrong past participle forms
of irregular verbs (*runned/run), and do not include those aspects of English that are
harder to learn. An error correction system that can only correct one or a few types
of errors will be of limited use to learners. Instead, a good system should be able to
correct a variety of error types and corrections should be performed at a global rather
than local level, including taking interacting errors into account. Also, the GEC models
can go into the pipeline of several Natural Language Generation (NLG) systems like
Machine Translation, Question Answering, Speech to text etc. to enhance the quality of
the candidate hypotheses predicted by these systems.
1
1.1 Outline 2
1.1 Outline
Chapter 2 gives an overview of grammatical error correction. It discusses the types of
grammatical errors that can occur in a sentence, properties of an ideal grammatical er-
ror correction system and popular evaluation metrics like BLEU, GLEU and Max-Match
Scorer (F-score). In chapter 3, we present a survey of traditional approaches to gram-
matical error correction. Chapter 4 describes the popular datasets used for training GEC
models. We also describe the genERRate tool used to introduce grammatical errors into
sentences. In chapter 5, we describe the components of an SMT system and their training
to generate a baseline SMT model for GEC. In chapter 6, we discuss about the architec-
ture and implementation of NMT model for GEC using word level and sub-word(Byte
pair Encoding) level translation units. In chapter 7, we explore a candidate re-ranking
technique to re-rank the candidate hypotheses generated by an NLG system. In chapter
8, we present the results of our experiments, compare them and discuss about the gen-
eral observations. In Chapter 9, we look into possible future steps for GEC. Finally, we
conclude the report in chapter 10.
Chapter 2
Overview of Grammatical ErrorCorrection
Grammatical error correction (GEC) is the task of automatically correcting grammati-
cal errors in text. More specifically, the task is to build a system that takes an input text,
analyses the context of the text to identify and correct any grammatical errors, and re-
turns a corrected version that retains the original meaning. There are different kinds of
grammatical errors that can occur in a text and it is difficult to correct some of them. The
following sections describes some of the common errors, difficulties in correcting them
and metrics used for evaluating a GEC system.
2.1 Types of grammatical errors
There can be different types of grammatical errors that can occur in a text. Some are
very frequent, some are less frequent, some adversely affects the readability and some
does not affect the readability much. The following are some of the common error types.
• Subject-Verb agreement: These errors occur when the subjects and verbs of a
sentence do not agree in person or in number.
Ex: Incorrect: He walk to college.
Correct: He walks to college.
• Verb tense: These errors occur when an incorrect verb tense is used in a sentence.
Ex: Incorrect: I have seen him yesterday.
Correct: I saw him yesterday.
3
2.1 Types of grammatical errors 4
• Noun Agreement: Nouns must agree in number to the nouns they are referencing.
This means that singular nouns must be used to refer to singular nouns and plural
nouns must be used to refer to plural nouns.
Ex: Incorrect: There are lot of restaurant in the college.
Correct: There are lot of restaurants in the college.
• Pronoun: Like subjects that agree with verbs and nouns that agree with other other
nouns, pronouns must agree in gender,person and number with their antecedent.
Ex: Incorrect: The girls won her game.
Correct: The girls won their game.
• Word form: Word form errors occur when the correct word is chosen but an incor-
rect form(part of speech) of the word is used.
Ex: Incorrect: The danger(noun) tiger ran through the village.
Correct: The dangerous(adjective) tiger ran through the village.
• Word order: These errors occur when the words are not in proper order in a sen-
tence and does not convey proper meaning.
Ex: Incorrect: He played yesterday football.Correct: He played football yesterday.
• Preposition: These errors occur due to missing or incorrect usage of preposition in
a sentence.
Ex: Incorrect: The train will arrive within five minutes.
Correct: The train will arrive in five minutes.
• Article: These errors occur due to missing or incorrect usage of article in a sen-
tence.
Ex: Incorrect: The Paris is big city.
Correct: Paris is a big city.
• Double negatives: Double negatives are two negative words used in the same sen-
tence. Using two negatives turns the thought or sentence into a positive one.
Ex: Incorrect: I can’t hardly believe.
Correct: I can hardly believe.
2.2 Properties of an ideal GEC system 5
2.2 Properties of an ideal GEC system
This section describes some of the key properties to be satisfied by an ideal grammatical
error correction system.
• Error coverage denotes the ability of a system to identify and correct a variety of
error types.
• Error complexity indicates the capacity of a system to address complex errors such
as those where multiple errors interact. An ideal GEC system should also correct
errors which depend on long range contextual information.
• Generalizibility refers to the ability of a system to identify errors in new unseen
contexts and propose corrections beyond those observed in training data.
2.3 Evaluation Metrics
Evaluation metrics allow one to measure the performance of the system. When evaluat-
ing a GEC system, the system’s output is compared to gold-standard references provided
by human experts. There are several metrics that have been proposed and used to evaluate
GEC systems. In this section, we describe the three commonly used metrics.
2.3.1 BLEU
BLEU was first proposed by Papineni et al. (2002) and is now used as the dominant
method for MT evaluation. It estimates the quality of the text produced by MT systems
so that the closer it is to professional human translation, the better.
Unlike metrics which rely on references with explicitly labelled error annotations,
BLEU only requires corrected references. Since both the original and corrected sentences
are in the same language (i.e. English) and most words in the sentence do not need
changing, BLEU scores for GEC systems are relatively high compared with standard MT
tasks. BLEU also allows multiple references, which is useful for errors with multiple
alternative corrections but it fails to provide detailed performance of the system for each
error type.
2.3 Evaluation Metrics 6
2.3.2 GLEU
Generalized Language Evaluation Understanding (GLEU), proposed by Napoles
et al. (2015), is a simple variant of BLEU for GEC which takes the original source also
into account. GLEU modifies the n-gram precision in BLEU to assign extra weight to
n-grams present in the candidate that overlap with the reference but not the source, and
penalize those in the candidate that are in the source but not the reference.
Similar to BLEU, GLEU also allows multiple references at the corpus level and fails
to provide detailed performance of the system for each error type.
2.3.3 Max-Match(M2) Scorer
M2 scorer, proposed by Dahlmeier and Ng (2012b), is used to evaluate system
performance by how well its proposed corrections or edits(ei) match the gold standard
edits(gi). It computes the sequence of phrase-level edits between a source sentence and
a system’s candidate that achieves the highest overlap with the gold standard annotation.
The system is evaluated by computing precision(P), recall(R) and F-score.
P =
∑ni=1 |ei ∩ gi|∑n
i=1 |ei|(2.1)
R =
∑ni=1 |ei ∩ gi|∑n
i=1 |gi|(2.2)
Fβ = (1 + β2) ∗P ∗ R
(1 + β2) ∗ P + R(2.3)
The M2scorer was the official scorer in the CoNLL 2013 and 2014 shared tasks on
GEC, where F1 was used in CoNLL-2013 and F0.5 was used in CoNLL-2014.
Chapter 3
Literature Survey
This chapter presents a survey of work done on grammatical error correction. It dis-
cusses about the earlier rule-based and machine learning classifier approaches and more
recent machine translation approaches to GEC.
3.1 Rule based approach
Early attempts at grammatical error correction employed hand-coded rules (Heidorn
et al. (1982); Bustamante and León (1996)). Initial rule-based systems were based on
simple pattern matching and string replacement. Gradually the rule-based systems also
incorporated syntactic analysis i.e POS tagging and Tree parsing and manually developed
grammar rules. In this approach, a set of rules is matched against a text which has at
least been POS tagged. Rule-based systems are generally very easy to implement for
certain types of errors and can be very effective. However, most of the errors are complex
and the rule-based systems fail to correct those errors. Due to the limitless ability to use
language, it becomes impossible to define rules for every possible error. As a result, most
of the current GEC systems does not employ rule-based mechanism.
3.2 Classifier based approach
With the advent of large-scale annotated data, several data-driven approaches were
developed and machine learning algorithms were applied to build classifiers to correct
specific error types (Han et al. (2004); Rozovskaya and Roth (2011)). The classifier
approach to error correction has been prominent for a long time before MT, since building
a classifier does not require having annotated learner data.
7
3.2 Classifier based approach 8
In the classifier approach, GEC is cast as a multi-class classification problem.
For correcting the errors, a finite candidate set containing all the possible correction
candidates such as list of prepositions is considered as set of class labels. Features for the
classifier include surrounding n-grams, part of speech(POS) tags, grammatical relations,
parse trees, etc. Since the most useful features for the classifier depend on the error type,
classifiers are trained individually to handle specific error type. The linguistic features
are embedded into vectors and the classifier is trained using various machine learning
algorithms. New errors can be detected and corrected using the trained classifier by
comparing the original word in the text with the most likely candidate predicted by the
classifier. Earlier, classifier approaches are mainly used to correct article and preposition
errors since these are the common errors and can be tackled easily with machine learning
approaches. Han et al. (2004) trained a maximum entropy classifier to detect article
errors and achieved an accuracy of 88%. Tetreault and Chodorow (2008) used maxi-
mum entropy models to correct errors for 34 common English prepositions in learner text.
Each classifier corrects a single word for a specific error category individually. This
ignores dependencies between the words in a sentence. Also, by conditioning on the
surrounding context, the classifier implicitly assumes that the surrounding context is free
of grammatical errors, which is often not the case. But in many real world cases sentences
may contain multiple errors of different types and these errors may depend on each other.
So a GEC system which can correct only one particular type will be of very limited use.
Over time researchers have come up with multiple solutions to address the problem of
correcting multiple errors in a sentence. One of the commonly used approach is to build
multiple classifiers one for each error type and cascade them into a pipeline. Rozovskaya
et al. (2013) proposed a combination of rule-based and classifier models to build GEC
systems which can correct multiple errors. But the above proposed approaches are useful
only when the errors are not dependent on each other and are unable to solve the problem
of dependent errors. A typical example of predictions made by the Rozovskaya et al.
(2013) system of multiple classifiers for a sentence containing dependent errors is shown
below.
Example: The dogs is barking in the street.
System predicted output: The dog are barking in the street.
Several approaches have been proposed to address the problem of interacting er-
rors. Dahlmeier and Ng (2012a) developed a beam-search decoder for correcting inter-
acting errors. The decoder iteratively generates new hypothesis corrections from current
3.3 Machine Translation approach 9
hypotheses and scores them based on features of grammatical correctness and fluency.
These features include scores from discriminative classifiers for specific error categories,
such as articles and prepositions and a language model(LM). This decoder approach out-
performed a pipeline system of individual classifiers and rule-based models.
3.3 Machine Translation approach
A practical GEC system should be able to correct various types of errors. In more
recent research, MT techniques have been used to successfully correct a broader set of
errors. Machine translation systems automatically translate text from a source language
into a target language. Grammatical error correction thus can be treated as a translation
problem from ungrammatical sentences into grammatical sentences.
3.3.1 Statistical Machine Translation
SMT has been the dominant MT approach for the past two decades. The model consists
of two components: a language model assigning a probability p(e) for any target sentence
e, and a translation model that assigns a conditional probability p(f | e). The language
model is learned using a monolingual corpus in the target language. The parameters of
the translation model are estimated from a parallel corpus, i.e. the set of foreign sentences
and their corresponding translations into the target language. Brockett et al. (2006) first
proposed the use of an SMT model for correcting a set of 14 countable/uncountable
noun errors made by learners of English. Their training data consists of a large corpus
of sentences extracted from news articles which were deliberately modified to include
typical countability errors. Artificial errors are introduced in a deterministic manner
using hand coded rules including operations such as changing quantifiers (much →
many), generating plurals (advice → advices) or inserting unnecessary determiners.
Experiments show their SMT system was generally able to beat the standard Microsoft
Word 2003 grammar checker, although it produced a relatively higher rate of erroneous
corrections.
Mizumoto et al. (2011) used the similar SMT approach that focuses on correcting
grammatical errors of Japanese learners. However, their training corpus comprised
authentic learner sentences together with corrections made by native speakers on a social
learning network website Lang-8. Their results show that the approach is a viable way
of obtaining very high performance at a relatively low cost provided a large amount of
training data is available. Mizumoto et al. (2012) investigated the effect of training corpus
3.3 Machine Translation approach 10
size on various types of grammatical errors in English language. Their results showed
that a phrase-based SMT system is effective at correcting errors that can be identified
by a local context, but less effective at correcting errors that need long-range contextual
information. Yuan and Felice (2013) trained a POS-factored SMT system to correct five
types of errors in text for the CoNLL-2013 shared task on GEC. These five error types
involve articles, prepositions, noun number, verb form, and subject-verb agreement. The
limited training data provided for the task was not sufficient for training an effective SMT
system so they also explored alternative ways of generating pairs of incorrect and correct
sentences automatically from other existing learner corpora.
In the CoNLL-2014 shared task on GEC, several top performing systems employed
hybrid approaches. Felice et al. (2014) proposed a pipeline of rule-based system and
a phrase-based SMT system augmented by a large web based language model. The
generated candidates are ranked using the language model(LM), with the most probable
candidate being selected as the final corrected version. Susanto (2015) made an attempt at
combining MT and classifier models. They used CoNLL-train and Lang-8 as non-native
data and English Wikipedia as native data. Junczys-Dowmunt and Grundkiewicz (2014)
also employed the phrase-based SMT approach. They used the word-level Levenshtein
distance between source and target as a translation model feature. To increase the
precision, they tuned the feature weights for F-score using the k-best Margin Infused
Relaxed Algorithm (MIRA) Cherry and Foster (2012) and Minimum Error Rate Tuning
(MERT) Och (2003). Kunchukuttan et al. (2014) subsequently found that tuning for
F-score to increase precision yielded worse performance.
More recently, Napoles and Callison-Burch (2017) proposed a light weight approach
to GEC called Specialized Machine translation for Error Correction (SMEC) which
represents a single model that handles morphological changes, spelling corrections, and
phrasal substitutions. This model is developed by examining different aspects of the
SMT pipeline, identifying and applying modifications tailored for GEC, introducing
artificial data, and evaluating how each of these specializations contributes to the overall
performance. The analysis provided in this work will help improve future efforts in GEC,
and can be used to inform approaches rooted in both neural and statistical MT.
Other approaches using machine translation for error correction are not aimed at
training SMT systems but rather at using them as auxiliary tools for producing round-trip
translations (i.e. translations into a pivot foreign language and back into English). Hermet
3.3 Machine Translation approach 11
and Desilets (2009) focused on sentences containing preposition errors and generated a
round-trip translation via French and their model was able to correct 66.4% of errors.
Madnani et al. (2012) used round-trip translations obtained from the Google Translate
API via 8 different pivot languages for an all-errors task.
3.3.2 Neural Machine Translation
Recently, neural machine translation (NMT) systems have achieved substantial
improvements in translation quality over phrase-based MT systems (Sutskever et al.
(2014); Bahdanau et al. (2014)). Thus, there is growing interest in applying neural
systems to GEC. Sun et al. (2015), employed a Convolutional Neural Network (CNN) for
article error correction. Instead of building classifiers using pre-defined syntactic and/or
semantic features, a CNN model is trained from surrounding words with pre-trained word
embeddings. Yuan and Briscoe (2016) proposed a neural machine translation approach
for GEC using recurrent neural network to perform sequence-to-sequence mapping from
erroneous to well-formed sentences.
The core component of most NMT systems is a sequence-to-sequence (seq2seq)
model which consists of an encoder and a decoder. An encoder encodes a sequence of
source words into a vector and then a decoder generates a sequence of target words from
the vector. Different network architectures have been proposed for NMT. Sutskever et al.
(2014) and Cho et al. (2014) used RNNs for both encoding and decoding. Sutskever et al.
(2014) used a multilayer Long Short-Term Memory (LSTM) to encode a source sentence
into a fixed-sized vector, and another LSTM to decode a target sentence from the vector
whereas Cho et al. (2014) used two Gated Recurrent Unit (GRU) models, one as the
encoder and another as the decoder.
Unlike the phrase-based SMT models, the seq2seq model can capture long-distance,
or even global, word dependencies, which are crucial to correcting global grammatical
errors. In order to achieve better performance on GEC, a seq2seq model has to address
several task-specific challenges like dealing with an extremely large vocabulary size, and
capturing structure at different levels of granularity in order to correct errors of different
types. For example, while correcting spelling and local grammar errors requires only
word-level or sub-word level information, e.g., violets→ violates (spelling) or violate→
violates (verb form), correcting errors in word order or usage requires global semantic
relationships among phrases and words. Yuan and Briscoe (2016), addressed the large
vocabulary problem by restricting the vocabulary to a limited number of high-frequency
words and re-sorting to standard word translation dictionaries to provide translations for
3.3 Machine Translation approach 12
the words that are out of the vocabulary (OOV). However, this approach often fails to
take into account the OOVs in context for making correction decisions, and does not
generalize well to correcting words that are unseen in the parallel training data. An
alternative approach, proposed by Xie et al. (2016), applies a character-level sequence
to sequence neural model. Although the model eliminates the OOV issue, it cannot
effectively leverage word-level information for GEC, even if it is used together with a
separate word-based language model.
Bahdanau et al. (2014) and Luong et al. (2015) used attention mechanism in NMT
and have shown that attention-based models are better than non-attentional ones in han-
dling long sentences. Ji et al. (2017) proposed a hybrid neural model with nested attention
layers for GEC. It consists of a word-based seq2seq model as a back-bone which closely
follows the basic neural seq2seq architecture with attention as proposed by Bahdanau
et al. (2014), and additional character-level encoder, decoder, and attention components,
which focus on words that are outside the word-level model vocabulary. This nested atten-
tion model is shown to be very effective at correcting global word errors and significantly
outperforms previous neural models for GEC as measured on the standard CoNLL-14
benchmark dataset.
Chapter 4
Datasets
Learner corpora for GEC are produced by non-native English speakers. There are two
broad categories of parallel data for GEC. The first is error-coded text, in which annotators
have coded spans of learner text containing an error. The second class of GEC corpora
are parallel datasets, which contain the original text and a corrected version of the text,
without explicitly coded error corrections.
Synthetic learner corpora for GEC have also been developed by artificially introducing
errors into the grammatically correct sentences.
4.1 Natural data
There are lot of publicly available datasets for both error annotated corpora and parallel
corpora. Some of the commonly used error annotated corpora include the NUS Corpus of
Learner English (NUCLE; 57k sentence pairs) (Dahlmeier et al. (2013)), the Cambridge
Learner Corpus (CLC; 1.9M pairs) (Nicholls (2003)), and a subset of the CLC, the First
Certificate in English (FCE; 34k pairs) (Yannakoudakis et al. (2011)). MT systems are
trained on parallel text, which can be extracted from error-coded corpora by applying the
annotated corrections, resulting a clean corpus with nearly-perfect word and sentence
alignments.
Two popular parallel corpora are the Automatic Evaluation of Scientific Writing
corpus, with more than 1 million sentences of scientific writing corrected by professional
proofreaders (Daudaravicius et al. (2016)), and the Lang-8 Corpus of Learner English,
which contains 1 million sentence pairs scraped from an online forum for language
learners, which were corrected by other members of the lang-8.com online community.
13
4.2 Synthetic data 14
In our experiments, we used the Lang-8 parallel corpus and Synthetic Brown corpus
(see section 4.2) in generating an MT system for GEC.
4.2 Synthetic data
Artificial errors have been employed previously in targeted error detection. Sjöbergh
and Knutsson (2005) introduced split compound errors and word order errors into
Swedish texts and used the resulting artificial data to train their error detection system.
Brockett et al. (2006) introduced errors involving mass/count noun confusions into En-
glish news wire text and then used the resulting parallel corpus to train a phrasal SMT
system to perform error correction.
We generated synthetic data by introducing errors into Brown corpus using the
GenERRate error generation tool (Foster and Andersen (2009)). This tool takes a corpus
and an error analysis file consisting of a list of errors as input and produces an error
tagged corpus of ungrammatical sentences. The errors are introduced by inserting,
deleting, moving and substituting POS tagged words in a sentence as mentioned in the
error configuration file.
Example:
The couple was married Aug. 2 , 1913 .
↓
GenERRate
↓
The couples was married Aug. 2 , 1913 .
The couple was marrying Aug. 2 , 1913 .
The couple was marries Aug. 2 , 1913 .
The couple were married Aug. 2 , 1913 .
Using this tool, we generated a parallel corpus containing approximately 900k sentence
pairs and used them for training an MT system for GEC.
Chapter 5
Statistical Machine Translation
In this chapter, we describe the structure of the SMT model and try to apply it for GEC
by training the model using the Lang-8 and Synthetic brown data (see section 4.2).
The statistical machine translation approach is based on the noisy-channel model.
The best translation e∗ for a foreign sentence f is:
e∗ = argmaxe
p(e| f ) = argmaxe
p( f |e)p(e) (5.1)
There are three main components in SMT model for computing each of the param-
eters p(e), p(f|e), e∗. Each of the components are described in the section below.
5.1 Components of an SMT model
The three components of SMT model are language model (LM), translation model
(TM) and decoder.
5.1.1 Language model
A language model (LM) is a function that takes an English sentence as input and
returns the probability that it is a valid English sentence. It computes the parameter p(e)
in eq(5.1). N-gram language models are commonly used in SMT systems.
For a sentence e = { w1,w2,....,wm}, the N-gram LM computes the probability p(e)
by computing the conditional probabilities for each of the word wi in the sentence.
p(e) = p(w1, w2, ...., wm) =
m∏i=1
p(wi|wi−n+1, ...., wi−1) (5.2)
where
p(wi|wi−n+1, ...., wi−1) =#(wi−n+1, ...., wi)
#(wi−n+1, ...., wi−1)(5.3)
15
5.2 Training an SMT model for GEC 16
5.1.2 Translation model
A translation model (TM) gives an estimation of the lexical correspondence between
languages. It computer the parameter p(f|e) in eq(5.1). The translation models are trained
using parallel corpora. The parallel corpora are sentence aligned and then word aligned
using a set of statistical models developed at IBM in the 80s. These word alignments are
used to extract phrase-phrase translations, or hierarchical rules as required, and corpus-
wide statistics on these rules are used to estimate probabilities.
5.1.3 Decoder
Decoder chooses the best translation e∗ from a pool of all possible candidate transla-
tions i.e it chooses that hypothesis e such that p(f|e)p(e) is maximum. Beam search is one
of the popular decoders used in SMT systems.
5.2 Training an SMT model for GEC
In our experiment, SMT-based GEC system is built using Moses, an open source toolkit
for SMT developed by Koehn et al. (2007). We used Lang-8 data and Synthetic brown
data (see section 4.2) for training, tuning and testing the SMT model.
#sentence pairs
Lang-8 938495
Synthetic brown 815921
Total 1754416
Train 1300000
Tune 100000
Test 354416
Table 5.1: Train,tune,test split for SMT
Word alignments are generated using MGIZA++, multi-thread implementation of
GIZA++ (Och and Ney (2003)) and are used to construct phrase tables. Translation mod-
els are generated using the train split containing 1300k sentence pairs. For generating the
LMs (upto 5-gram), we used KenLM (Heafield (2011)) toolkit and grammatically correct
sentences in both train and tune splits. Results of this experiment are discussed in chapter
8.
Chapter 6
Neural Machine Translation
NMT as an MT approach to GEC has shown promising results (Sutskever et al. (2014);
Bahdanau et al. (2014)). Compared with conventional SMT, NMT has several advantages.
First, unlike SMT which consists of components that are trained separately and combined
during decoding (see section 5.1), NMT learns a single large neural network which inputs
a source sentence and outputs a translation. As a result, training NMT systems for end-to-
end translation tasks is much simpler than building SMT systems, which requires multiple
processing steps. Second, an NMT system can learn translation regularities directly from
the training data, without the need to explicitly design features to capture them, which
is quite difficult in SMT. The typical architecture of an NMT model is described in the
section below.
6.1 Architecture of NMT model
NMT applies an encoder-decoder framework. An encoder first reads a variable length
input sentence and encodes all (or parts) of the input sentence into a vector representation.
A decoder then outputs a translation for the input sentence from the vector representation.
Given a source sentence X = { x1,x2,...,xm}, and a target sentence Y = { y1,y2,...,yn},
NMT models the conditional probability of translating the source sentence X to the target
sentence Y as:
p(y1, y2..., yn|x1, x2, ..., xm) (6.1)
In this section, we describe the OpenNMT (Klein et al. (2017)) architecture (see fig
6.1) we used in our experiments.
In our experiments, we used a Bidirectional LSTM (Bi-LSTM) encoder which con-
sists of a forward LSTM layer and a backward LSTM layer.The forward LSTM reads the
input sentence from the first word to the last word (from x1 to xm ), and the backward
17
6.1 Architecture of NMT model 18
Figure 6.1: OpenNMT architecture, an encoder-decoder model with Attention mechanism
LSTM reads the input sentence in reverse order (from xm to x1). By using a Bi-LSTM,
both historical and future information is captured. The intermediate vector is generated
by concatenating the outputs of forward and backward LSTM layers and then applying a
global attention model over the entire source sequence (see fig 6.2). An attention mecha-
nism is employed to help the decoder focus on the most relevant information in the input
sentence, instead of remembering the entire input sentence. A simple LSTM layer is used
for decoding the target sentence from the intermediate vector.
Figure 6.2: Attention mechanism in OpenNMT model
6.2 Training an NMT model for GEC 19
6.2 Training an NMT model for GEC
Given a corpus of parallel sentences, an NMT system is trained to maximise log-
likelihood:
maxθ
N∑n=1
log P(Yn|Xn, θ) = maxθ
N∑n=1
T ′∑t=1
log P(ynt |y
n1, y
n2, ..., y
nt−1, X
n, θ) (6.2)
where θ = [θenc, θdec] represents all the parameters, N is the total number of training
examples in the corpus and (Xn, Yn) is the nth pair. We maximize the log-likelihood
using Stochastic Gradient Descent (SGD) with decaying learning rate. We performed
experiments using word level and sub-word level translation units, which are described in
the following sections. For training the NMT models, we used the Torch implementation
of OpenNMT framework. Results of these experiments are discussed in chapter 8.
6.2.1 Word level translation
During preprocessing, tokenization is done to transform sentences into sequences
of tokens. In word level translation, this tokenization is done to generate tokens, each of
which is a word in a sentence.
We trained two NMT models, one using only the natural Lang-8 data and another
using both Lang-8 and synthetic brown data.
6.2.2 Sub-word level translation
In sub-word level translation, tokenization is done to generate tokens, each of which
is a character n-gram. Several sub-word level translation units include character-level
units, syllable level units and Byte Pair Encoding (BPE) units. We trained our NMT
models using the BPE units.
6.2.2.1 Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) (Shibata et al. (1999)) is a simple data compression
technique that iteratively replaces the most frequent pair of bytes in a sequence with a
single, unused byte. We adapt this algorithm for word segmentation. Instead of merging
frequent pairs of bytes, we merge characters or character sequences.
Firstly, we initialize the symbol vocabulary with the character vocabulary, and rep-
resent each word as a sequence of characters, plus a special end-of- word symbol, which
6.2 Training an NMT model for GEC 20
allows us to restore the original tokenization after translation. We iteratively count all
symbol pairs and replace each occurrence of the most frequent pair (’A’, ’B’) with a new
symbol ’AB’. Each merge operation produces a new symbol which represents a character
n-gram. Frequent character n-grams (or whole words) are eventually merged into a single
symbol. For efficiency, we do not consider pairs that cross word boundaries. So, the algo-
rithm can be run on the dictionary extracted from a text, with each word being weighted
by its frequency.
Example 6.1 Sample BPE generation example
Vocabulary : ’low’, ’lowest’, ’newer’, ’wider’
BPE units (’.’ is end-of-word symbol ) :
r . → r.
l o→ lo
lo w→ low
e r. → er.
BPE level translation models were shown to perform well for translation between re-
lated languages. We can cast GEC as a task of translation between two highly related
languages having same vocabulary but not the grammatical structure.
6.2.2.2 Training NMT using BPE
We trained an NMT model using BPE units. Both Lang-8 and synthetic brown data
were used for training. The OpenNMT framework also supports the sub-word level to-
kenization in addition to word level tokenization and generates the BPE units from the
parallel data. For our experiment, BPE units were generated using 30k merge operations
and the models were trained with same parameters as that of word level translation.
Chapter 7
Scoring grammaticality of machinetranslation hypotheses
Over the last decade, re-ranking techniques, especially discriminative re-ranking, have
shown significant improvement in MT systems not only for GEC but also for any other
translation tasks (Hoang et al. (2016); Yuan et al. (2016)). For each source sentence,
rather than outputting the candidate with the highest probability directly, an n-best list
of candidate translations is collected from an MT system and later re-ranked using
re-ranking algorithms. Various machine learning algorithms have been adapted to these
re-ranking tasks, including boosting, perceptrons and SVMs.
In this chapter, we discuss about our experiments on a re-ranking technique to re-
rank the candidate hypotheses of any NLG system by assigning a grammaticality score to
each candidate hypothesis. Lower the score, the better (more grammatical) the candidate.
Our scoring system is a supervised regressor which trains on a large number of sentences
with varying grammaticality. In the following sections, we discuss the type of regression
models that we have experimented on and also the process of generating training data for
these models.
7.1 Generating training data
To make our system as generic as possible, we wanted our training data to be as di-
verse as possible, encompassing different levels of grammatical errors. So we used the
sentences from test split of SMT system (see section 5.2) and their 3-best candidate trans-
lations from our SMT model for GEC (see chapter 5) which accounts for a total of ap-
proximately 1200k sentences.
21
7.2 Regression models 22
For each instance in our training data, we obtained the grammaticality scores by
computing an edit measure that indicates how much effort it will take to edit the incorrect
sentence to bring it to its correct form; higher the effort, less is the grammaticality. We
improvise the edit scores by assigning a score to each word in the source side (phrase
from source sentence) and target side (phrase from target side) of the edit based on its
POS tag and then picking the maximum of sum of scores in each side of the edit.
Example 7.1 Improvised edit score for a sample sentence
Source : He are a going by the office .
Target : He is going to the office .
edit 1 : (’are’ 3.0; ’a’ 1.0→ ’is’ 3.0)
edit 1 score = max(3.0+1.0, 3.0) = 4.0
edit 2 : (’by’ 2.0→ ’to’ 2.0)
edit 2 score= max(2.0, 2.0) = 2.0
Grammaticality score = 4.0 + 2.0 = 6.0
We then explore various traditional feature-based and deep-representation-based re-
gression techniques for training the model.
7.2 Regression models
Once the training data is created, we explore a few regression frameworks follow-
ing traditional feature based paradigm such as (a) Linear Regression (LR) (b) Support
Vector Regression (SVR) (c) Regression using Multilayer Perceptron (MLP), and a deep-
representation learning based paradigm based on long short term memory (LSTM). For
training the models, we used a train:test split of 80:20% on the above generated training
data. We used Pearson correlation coefficient as evaluation metric to find the correlation
between the true and predicted values.
7.2.1 Feature based regression
For implementing feature based regressors (SVR, LR, MLP), the features that we used
are: (a) averaged-out word embeddings (of dimension 300) of the input text obtained
using SpaCy (an open source library for NLP) and (b) Log like-likelihood of the sentence
obtained using a 5-gram language model generated while training the SMT system (see
section 5.1.1). For SVR, we try using Linear, Polynomial (degree=2) and RBF kernels
with default parameters. For MLP, the hidden layer size was empirically set to 8.
7.2 Regression models 23
7.2.2 Deep-representation based regression
For deep-representation based regression, we implemented several recurrent neu-
ral network (RNN) variants using LSTM as the basic unit. The first variant uses a
stack of 2 LSTM layers on top of an input embedding layer followed by a dense
layer with linear activation. The second variant is same as the first, except that it uses a
single bidirectional LSTM layer, trying to capture contextual information from both sides.
For all the above specified configurations, we used dropout of 0.25 at the penultimate
layer, mean squared logarithmic error as the loss function and adam as the optimizer.
Chapter 8
Results and Discussions
In this chapter, we discuss the results of our baseline SMT, word level NMT, sub-
word level NMT models and also results of our grammaticality scoring models that we
described in chapter 7.
We use the BLEU, GLEU, M2 scorer (F0.5) evaluation metrics to evaluate our MT
models. Table 8.1 and Table 8.2 shows the results of our MT-based GEC models on
validation data using in our experiments and CoNLL-2014 test data.
Model BLEU GLEU
SMT 83.2 73.54
Word level NMT (no synthetic data) 75.97 67.99
Word level NMT (using synthetic data) 79.63 72.71
Sub-word level NMT 52.7 47.26
Table 8.1: Performance of MT models on validation data
Model BLEU GLEU F0.5 score
Our Models
SMT 84.96 56.84 22.63
Word level NMT (no synthetic data) 83.47 56.89 15.71
Word level NMT (using synthetic data) 86.52 58.11 21.2
Sub-word level NMT 79.45 55.7 16.74
Top 3 systems in CoNLL-2014 shared task
CAMB - - 37.33
CUUI - - 36.79
AMU - - 35.01
Table 8.2: Performance of MT models on CoNLL-2014 test data
24
25
From Table 8.1, we can see that SMT system is performing well on validation
data when compared to NMT models. It might have seen a lot of necessary correction
mappings for validation data in the training data. Also, we can see from Table 8.2 that
it is performing well on CoNLL-2014 test data in terms of F-score whereas the NMT
model trained using both Lang-8 and synthetic brown data gives higher values for BLEU
and GLEU.
Let’s call our NMT model trained using only Lang-8 data as NMT1, using both
synthetic and Lang-8 data as NMT2 and sub-word level NMT model as NMT3.
Example 8.1 Correcting errors that depend on local contextual information
Source : I going the to college .
SMT : I go to college .
NMT1 : I ’m going to college .
NMT2 : I am going to college .
NMT3 : I am going to college .
Reference: I am / ’m going to the college .
Example 8.2 Correcting errors that depend on global contextual information
Source : The player in the ground are talking to each other .
SMT : The player in the ground are talking to each other .
NMT1 : The player in the ground is talking to each other .
NMT2 : The player in the ground are talking to each other .
NMT3 : The players in the ground are talking to each other .
Reference: The players in the ground are talking to each other .
From the above examples, we can see that our models are able to correct the errors
that depend on local context but they find difficulty in correcting the global errors.
Table 8.3 shows the performance of our scoring models described in chapter 7
on validation data using Pearson correlation coefficient. We can see that the scores
predicted by our scoring models are not correlating well with the improvised edit
scores (see section 7.1). The scores are predicted by looking at the whole context
of the sentence. The averaged word embeddings that we input for the feature-based
regression models will be very similar for all the candidate hypotheses predicted by an
NLG system for an input sentence. So it is possible that the Feature-based regression
models are not able to identify the minor differences between the candidate hypotheses.
26
Model Pearson
Feature based regression models
SVR (RBF kernel) 0.0033
SVR (Polynomial kernel) 0.006
SVR (Linear kernel) 0.019
Linear Regression 0.12
MLP 0.13
Deep-representation based regression models
2 LSTM layers 0.024
Bi-LSTM layer 0.032
Table 8.3: Performance of our scoring models on validation data
This problem also persists in deep-representation models and moreover the presence
of additional context in a sentence also add some amount of bias to the score (see Ex: 8.3).
Example 8.3 Bias due to the additional context
Sentence A : I is going to the school . → Score = 2.33
Sentence B : I is going to the school on foot . → Score = 4.2
Chapter 9
Future work
In our future work, we will develop on top of our baseline MT models to improve the
performance of GEC system in correcting complex errors. We propose a few ideas which
can improve the quality of the GEC system.
• Using GEC evaluation metrics to train NMT model :Currently, NMT models are trained to maximize the log-likelihood (see section
6.2). We propose to train the NMT models to maximize the GEC evaluation
metrics like F0.5 score and GLEU instead of log-likelihood. Since these metrics are
not differentiable, we have to train the models in a policy gradient way.
• Multi-task learning :In GEC, most of the errors are interacting errors and may depend on long range
contextual information. MT models finds difficulty in learning the global contextual
information for correcting these errors. Training the model for multiple tasks like
GEC, POS tagging, etc. simultaneously might improve the performance of the GEC
system. Through multi-task learning, the system can learn better global features
which aids in correcting the complex errors.
27
Chapter 10
Conclusion
In this report, we discussed about grammatical error correction (GEC) in non-native
English text. We treated GEC as a translation task from ungrammatical to grammatical
English. We presented a survey of traditional approaches to GEC i.e rule-based,
classifier-based and machine translation approaches. We also discussed about the
genERRate tool to generate synthetic parallel data using Brown corpus.
We implemented a baseline SMT model using Moses framework and, both word
level and sub-word level NMT models using OpenNMT framework. We trained our
models using Lang-8 and Synthetic Brown datasets. We discussed about Byte Pair
Encoding and its applications in sub-word level NMT models. We have shown that our
baseline MT models finds difficulty in correcting global errors and also compared them
with the top 3 systems of CoNLL-2014 shared task on GEC.
We experimented on a candidate re-ranking technique to re-rank the candidate
hypotheses of any NLG system by assigning a grammaticality score, but this technique
was not giving satisfactory results. Finally, we concluded with possible future work in
GEC using MT models and multi-task learning.
28
References
Bahdanau, D., Cho, K., and Bengio, Y., 2014, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473
Brockett, C., Dolan, W. B., and Gamon, M., 2006, “Correcting esl errors using phrasal
smt techniques,” in Proceedings of the 21st International Conference on Computational
Linguistics and the 44th annual meeting of the Association for Computational Linguis-
tics (Association for Computational Linguistics). pp. 249–256.
Bustamante, F. R., and León, F. S., 1996, “Gramcheck: A grammar and style checker,” in
Proceedings of the 16th conference on Computational linguistics-Volume 1 (Associa-
tion for Computational Linguistics). pp. 175–181.
Cherry, C., and Foster, G., 2012, “Batch tuning strategies for statistical machine trans-
lation,” in Proceedings of the 2012 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (Associa-
tion for Computational Linguistics). pp. 427–436.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
and Bengio, Y., 2014, “Learning phrase representations using rnn encoder-decoder for
statistical machine translation,” arXiv preprint arXiv:1406.1078
Dahlmeier, D., and Ng, H. T., 2012a, “A beam-search decoder for grammatical error cor-
rection,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (Association for
Computational Linguistics). pp. 568–578.
Dahlmeier, D., and Ng, H. T., 2012b, “Better evaluation for grammatical error correction,”
in Proceedings of the 2012 Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language Technologies (Association for
Computational Linguistics). pp. 568–572.
29
References 30
Dahlmeier, D., Ng, H. T., and Wu, S. M., 2013, “Building a large annotated corpus of
learner english: The nus corpus of learner english..” in BEA@ NAACL-HLT, pp. 22–
31.
Daudaravicius, V., Banchs, R. E., Volodina, E., and Napoles, C., 2016, “A report on the
automatic evaluation of scientific writing shared task..” in BEA@ NAACL-HLT, pp.
53–62.
Felice, M., Yuan, Z., Andersen, Ø. E., Yannakoudakis, H., and Kochmar, E., 2014,
“Grammatical error correction using hybrid systems and type filtering..” in CoNLL
Shared Task, pp. 15–24.
Foster, J., and Andersen, Ø. E., 2009, “Generrate: generating errors for use in grammatical
error detection,” in Proceedings of the fourth workshop on innovative use of nlp for
building educational applications (Association for Computational Linguistics). pp. 82–
90.
Han, N.-R., Chodorow, M., and Leacock, C., 2004, “Detecting errors in english article
usage with a maximum entropy classifier trained on a large, diverse corpus..” in LREC
Heafield, K., 2011, “Kenlm: Faster and smaller language model queries,” in Proceedings
of the Sixth Workshop on Statistical Machine Translation (Association for Computa-
tional Linguistics). pp. 187–197.
Heidorn, G. E., Jensen, K., Miller, L. A., Byrd, R. J., and Chodorow, M. S., 1982, “The
epistle text-critiquing system,” IBM Systems Journal 21, 305–326.
Hoang, D. T., Chollampatt, S., and Ng, H. T., 2016, “Exploiting n-best hypothe-
ses to improve an smt approach to grammatical error correction,” arXiv preprint
arXiv:1606.00210
Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., and Gao, J., 2017, “A
nested attention neural hybrid model for grammatical error correction,” arXiv preprint
arXiv:1707.02026
Junczys-Dowmunt, M., and Grundkiewicz, R., 2014, “The amu system in the conll-2014
shared task: Grammatical error correction by data-intensive and feature-rich statistical
machine translation,” in Proceedings of the Eighteenth Conference on Computational
Natural Language Learning: Shared Task, pp. 25–33.
Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M., 2017, “Opennmt: Open-
source toolkit for neural machine translation,” arXiv preprint arXiv:1701.02810
References 31
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan,
B., Shen, W., Moran, C., Zens, R., et al., 2007, “Moses: Open source toolkit for sta-
tistical machine translation,” in Proceedings of the 45th annual meeting of the ACL on
interactive poster and demonstration sessions (Association for Computational Linguis-
tics). pp. 177–180.
Kunchukuttan, A., Chaudhury, S., and Bhattacharyya, P., 2014, “Tuning a grammar cor-
rection system for increased precision..” in CoNLL Shared Task, pp. 60–64.
Luong, M.-T., Pham, H., and Manning, C. D., 2015, “Effective approaches to attention-
based neural machine translation,” arXiv preprint arXiv:1508.04025
Madnani, N., Tetreault, J., and Chodorow, M., 2012, “Exploring grammatical error cor-
rection with not-so-crummy machine translation,” in Proceedings of the Seventh Work-
shop on Building Educational Applications Using NLP (Association for Computational
Linguistics). pp. 44–53.
Mizumoto, T., Hayashibe, Y., Komachi, M., Nagata, M., and Matsumoto, Y., 2012, “The
effect of learner corpus size in grammatical error correction of esl writings,” Proceed-
ings of COLING 2012: Posters, 863–872.
Mizumoto, T., Komachi, M., Nagata, M., and Matsumoto, Y., 2011, “Mining revision log
of language learning sns for automated japanese error correction of second language
learners..” in IJCNLP, pp. 147–155.
Napoles, C., and Callison-Burch, C., 2017, “Systematically adapting machine translation
for grammatical error correction,” in Proceedings of the 12th Workshop on Innovative
Use of NLP for Building Educational Applications, pp. 345–356.
Napoles, C., Sakaguchi, K., Post, M., and Tetreault, J., 2015, “Ground truth for gram-
matical error correction metrics,” in Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference
on Natural Language Processing, Vol. 2, pp. 588–593.
Nicholls, D., 2003, “The cambridge learner corpus: Error coding and analysis for lexi-
cography and elt,” in Proceedings of the Corpus Linguistics 2003 conference, Vol. 16,
pp. 572–581.
Och, F. J., 2003, “Minimum error rate training in statistical machine translation,” in Pro-
ceedings of the 41st Annual Meeting on Association for Computational Linguistics-
Volume 1 (Association for Computational Linguistics). pp. 160–167.
References 32
Och, F. J., and Ney, H., 2003, “A systematic comparison of various statistical alignment
models,” Computational linguistics 29, 19–51.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J., 2002, “Bleu: a method for automatic
evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pp. 311–318.
Rozovskaya, A., Chang, K.-W., Sammons, M., and Roth, D., 2013, “The university of
illinois system in the conll-2013 shared task,” in Proceedings of the Seventeenth Con-
ference on Computational Natural Language Learning: Shared Task, pp. 13–19.
Rozovskaya, A., and Roth, D., 2011, “Algorithm selection and model adaptation for esl
correction tasks,” in Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies-Volume 1 (Association for
Computational Linguistics). pp. 924–933.
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., and
Arikawa, S., 1999, Byte Pair encoding: A text compression scheme that accelerates
pattern matching, Tech. Rep. (Technical Report DOI-TR-161, Department of Infor-
matics, Kyushu University).
Sjöbergh, J., and Knutsson, O., 2005, “Faking errors to avoid making errors: Very weakly
supervised learning for error detection in writing,” in Proceedings of RANLP, pp. 506–
512.
Sun, C., Jin, X., Lin, L., Zhao, Y., and Wang, X., 2015, “Convolutional neural networks
for correcting english article errors,” in Natural Language Processing and Chinese
Computing (Springer). pp. 102–110.
Susanto, R. H., 2015, Systems Combination for Grammatical Error Correction, Ph.D.
thesis
Sutskever, I., Vinyals, O., and Le, Q. V., 2014, “Sequence to sequence learning with neural
networks,” in Advances in neural information processing systems, pp. 3104–3112.
Tetreault, J. R., and Chodorow, M., 2008, “The ups and downs of preposition error detec-
tion in esl writing,” in Proceedings of the 22nd International Conference on Computa-
tional Linguistics-Volume 1 (Association for Computational Linguistics). pp. 865–872.
Xie, Z., Avati, A., Arivazhagan, N., Jurafsky, D., and Ng, A. Y., 2016, “Neural language
correction with character-based attention,” arXiv preprint arXiv:1603.09727
References 33
Yannakoudakis, H., Briscoe, T., and Medlock, B., 2011, “A new dataset and method for
automatically grading esol texts,” in Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies-Volume 1
(Association for Computational Linguistics). pp. 180–189.
Yuan, Z., and Briscoe, T., 2016, “Grammatical error correction using neural machine
translation..” in HLT-NAACL, pp. 380–386.
Yuan, Z., Briscoe, T., and Felice, M., 2016, “Candidate re-ranking for smt-based gram-
matical error correction..” in BEA@ NAACL-HLT, pp. 256–266.
Yuan, Z., and Felice, M., 2013, “Constrained grammatical error correction using statistical
machine translation..” in CoNLL Shared Task, pp. 52–61.