Upload
jennifer-atkinson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Evaluating Grammatical Error Detection and Correction
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff The CoNLL-2013 Shared Task on Grammatical Error Correction, Ng el al. Better Evaluation for Grammatical Error Correction, Dahlmeier and Ng
Comparison to Standard NLP Tasks
Annotator tag
Syst
em
outp
ut
Annotator tag
Syst
em
outp
ut
Lear
ner
sent
ence
Standard NLP evaluation
Error detection evaluation
Resource: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.
Traditional Evaluation Measures
Comma restoration task Commas are removed from well edited
text (gold standard) System tries to restore commas by
predicting their locations Comparison:▪ Binary distinction (presence or absence of
comma)
Traditional Evaluation Measures
Comma restoration task Comparison can be represented in
contingency table
Accuracy (A) Precision (P) Recall (R) F-measure (F1)
3-Way Contingency Table in Error Detection
Comma error detection task System seeks to find and correct errors
in the write’s usage of commas Intricacies:▪ Positive class: Error of the writer that involves
comma (not presence of comma) Mismatch between writer’s sentence and the annotator’s judgement▪ Negative class: writer and annotator agree▪ System’s judgement has not been considered
yet▪ Writer-Annotator-System (WAS)
3-Way Contingency Table in Error Detection
Contingency scheme for WAS
Considering System prediction and Writer’s form together
3-Way Contingency Table in Error Detection
Contingency scheme for WAS
Considering System prediction and Gold standard together
3-Way Contingency Table in Error Detection
The case of * row in simplified WAS contingency table Concerns different categories (TP,TN,FP,FN)
depending on whether the evaluation is for detection or correction▪ TP for detection ()▪ For correction: X Y Z is both FP (system writer) and
FN(system annotator)
Problem of Skewed Data
Distribution of positive and negative error classes are highly skewed 13% errors in preposition usages by L2
writers (Han el al., 2006) Baseline system always predicts “no
errors”▪ 87% accuracy
All the measures are affected by the proportion of errors in gold standard▪ Prevalence
Problem of Skewed Data
All the measures are affected by the proportion of cases that system reports as error Bias
Effect on a system that performs no better than chance
Increase in R when prevalence increases
Increase in P when bias increases
Problem of Skewed Data
• Expected match between Annotator and System product of their probabilities for respective categories (in this case Error/No-Error)
• the expected proportion of TP matches is equal to the product of the proportion of cases assigned the Error label by the Annotator (i.e., the prevalence) and the proportion of cases assigned the Error label by the System (i.e., the bias)
Expected proportion of TP match
Problem of Skewed Data
• Cohen’s kappa
• Accuracy = 0.68, Precision = 0.04/(0.04+0.16) = 0.2, Recall = 0.2 and F1= 0.2
System1: Predications are correct at chance level
Problem of Skewed Data
• Cohen’s kappa
• Accuracy = 0.80
• Removing the cases expected to show agreement by chance, the System is correct in 38% remaining cases
System2: Prevalence and bias remain the same
Problem of Skewed Data
• Cohen’s kappa
• Accuracy = 0.54, Precision = 0.40, Recall = 0.30, F1 = 0.34
System3: Increase bias and prevalence + Predications are correct at chance level
Problem of Skewed Data
Variability in prevalence or error rates () Prevalence changes with population of
learners with different native languages Different levels of proficiency in second
language Variability in bias ()
Detection system dependent Threshold for marking Error/Non-Error▪ Higher threshold lower bias▪ Lower threshold higher bias
Problem of Skewed Data: Remedies Dealing with sensitivity to bias
Vary threshold and generate precision-recall curve
Problem of Skewed Data: Remedies Dealing with sensitivity to bias
Area under Receiver Operating Characteristic (AUC) curve
False Positive Rate p(true|false)
True P
osi
tive R
ate
p
(tru
e|t
rue)
45𝑜
curv
e for r
andom p
redict
ion
Effect of random prediction is not nullified
Area under random prediction
Problem of Skewed Data: Remedies Dealing with sensitivity to bias
Area under curve (AUK)
False Positive Rate p(true|false)
Cohen’s
Class skewedness is already taken care of
False Positive Rate
True P
osi
tive R
ate
Counting Positives or Errors
Positive class consists of an error in writer’s text No 1:1:1 correspondence between
writer’s sentence, annotator’s correction and type of error
Book of my class inpired me
A Book in my class inspired meBooks for my class inspired meThe books of my class were inspiring to me
Article error
Number error
Article+Number error
Counting Positives or Errors
Assuming no ambiguity in error type What would be the size of unit over
which error is defined?
The book in my class inspire me
a) The book in my class inspires meb) The books in my class inspire me
• Unit size: Morpheme level? Word level? Phrase level? String level?
• Token-based approach vs String-based approach
Counting Positives or Errors
Variability of size can be handled with Edit Distance Measures (EDM) inspire inspires is same as book…
inspire book… inspires EDM can handle multiple overlapping
errors Sequence: “…development set is similar
with test set…..” Correction1: with to and the Correction2: with to the
EDM can handle both
Counting Positives or Errors
EDMs are good for comparison not for providing feedback to the writer If book and inspire are not linked
feedback like violation in subject-verb agreement cannot be provided
Counting Negatives or Non-Errors
Negatives consist of non-errors in writer’s text set complement of positive class? Appropriate set of non-errors cannot be
easily specified▪ Book of my class inspire to me▪ Negatives: a of, a my, a class, a inspire, a to,
a me, a .?▪ Should only the noun phrases be counted?▪ He is fond beer . ( positions or positions)
Counting Negatives or Non-Errors
Error data is biased towards negative class Negative counting strategy have greater
consequences in performance reporting Identifying negatives through trivial
means results inflation in true negatives (TN), keeping other counts in contingency table constant▪ Increases P, R, A,
Counting Negatives or Non-Errors
Accuracy: 0.54, Kappa = 0.00
Accuracy: 0.77, Kappa = 0.21
Inject 100 more TNs
Helping Our Own (HOO) and CoNLL Evaluation
Given: Gold standard (G), System Edits (E)
Example Learner sentence S▪ There is no a doubt, tracking system has brought many
benefits in this information age
System correction H▪ There is no doubt, tracking system has brought many
benefits in this information age .
Performance▪ P=1/1, R=1/3 F=1/2
HOO Evaluation
Extraction of system edit from writer’s text (source) and system output (hypothesis) done with GNU wdiff utilitySource: Our baseline system feeds word into PB-
SMT pipelineHypothesis: Our baseline system feeds a word into PB-SMT pipelineSystem edit: () inserting article a
Gold standard edit: ()
Hypothesis matches with first gold standard edit but flagged as invalid
HOO Evaluation: Modification
Key idea There may be multiple ways to arrive at
the same correction Extraction of the set of edits that
matches the gold standard maximally
MaxMatch () Algorithm
Notations : set of writer sentences : set of hypothesis or system outputs : set of gold standard annotations▪ : set of edits
MaxMatch () Algorithm
Notations An edit is a tripple <a,b,C>▪ Start and end token offsets a and b with
respect to a source sentence.▪ A correction C. ▪ For gold standard edit C is set of corrections▪ For system edit C is a single correction
MaxMatch () Algorithm
Evaluation of system output Extracting a set of system edits () for
each source-hypothesis pair (-)▪ Construction of edit lattice▪ Searching through the lattice for extracting
optimal set of edits Evaluating system edits with respect to
gold standard
Edit Lattice
Edit metric: Levenshtein distance Minimum number of insertions, deletions
and substitutions needed to transform one string to another
How to compute levenshtein distance?▪ Use a 2-D matrix (Levenstein matrix) to store
edit costs of substrings of string pairs▪ Compute individual cell entries (edit costs)
with dynamic programming▪ Rightmost corner cell stores optimal edit cost
How similar are two strings?
Spell correction The user typed
“graffe”Which is closest? ▪ graf▪ graft▪ grail▪ giraffe
• Computational Biology• Align two sequences of
nucleotides
• Resulting alignment:
• Also for Machine Translation, Information Extraction, Speech Recognition
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations Insertion Deletion Substitution
Needed to transform one into the other
Minimum Edit Distance
If each operation has cost of 1 Distance between these is 5
If substitutions cost 2 (Levenshtein) Distance between them is 8
How to find the Min Edit Distance? Searching for a path (sequence of
edits) from the start string to the final string: Initial state: the word we’re
transforming Operators: insert, delete, substitute Goal state: the word we’re trying to
get to Path cost: what we want to minimize:
the number of edits
Minimum Edit as Search
But the space of all edit sequences is huge! We can’t afford to navigate naïvely Lots of distinct paths wind up at the
same state.▪ We don’t have to keep track of all of them▪ Just the shortest path to each of those
revisited states.
Defining Min Edit Distance For two strings
X of length n Y of length m
We define D(i,j) the edit distance between X[1..i] and Y[1..j] ▪ i.e., the first i characters of X and the first j
characters of Y The edit distance between X and Y is thus
D(n,m)
Dynamic Programming forMinimum Edit Distance
Dynamic programming: Solving problems by combining solutions to
subproblems. A tabular computation of D(n,m)
Bottom-up We compute D(i,j) for small i,j And compute larger D(i,j) based on previously
computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0
< j < m)
Defining Min Edit Distance (Levenshtein)
InitializationD(i,0) = iD(0,j) = j
Recurrence RelationFor each i = 1…M
For each j = 1…N
D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j)
TerminationD(N,M) is distance
insertion
deletion
substitution
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1 2
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
D(0,1) + 1D(1,1)= min D(1,0) + 1 D(0,0) + 2; if X(1) ≠ Y(1) 0; if X(1) = Y(1)
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
Computing alignments
Edit distance isn’t sufficient We often need to align each character
of the two strings to each other We do this by keeping a “backtrace” Every time we enter a cell,
remember where we came from When we reach the end,
Trace back the path from the upper right corner to read off the alignment
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j)
LEFT ptr(i,j)= DOWN DIAG
insertion
deletion
substitution
insertion
deletion
substitution
Back to MaxMatch
Source: Our baseline system feeds word into PB-SMT pipelineHypothesis: Our baseline system feeds a word into PB-SMT pipelineSystem edit: () inserting article a
Gold standard edit: ()
Levenshtein Matrix Construction
0 1 2 3 4 5 6 7 8 9 10
# our
baseline
system
feeds
a word
into
PB-SMT
pipeline
.
0 #
1 Our
2 baseline
3 system
4 feeds
5 word
6 into
7 PB-SMT
8 pipeline
9 .
Levenshtein Matrix Construction
0 1 2 3 4 5 6 7 8 9 10
# our
baseline
system
feeds
a word
into
PB-SMT
pipeline
.
0 # 0 1 2 3 4 5 6 7 8 9 10
1 Our 1 0 1 2 3 4 5 6 7 8 9
2 baseline
2 1 0 1 2 3 4 5 6 7 8
3 system
3 2 1 0 1 2 3 4 5 6 7
4 feeds 4 3 2 1 0 1 2 3 4 5 6
5 word 5 4 3 2 1 1 1 2 3 4 5
6 into 6 5 4 3 2 2 2 1 2 3 4
7 PB-SMT
7 6 5 4 3 3 3 2 1 2 3
8 pipeline
8 7 6 5 4 4 4 3 2 1 2
9 . 9 8 7 6 5 5 5 4 3 2 1
Finding Shortest Paths
0 1 2 3 4 5 6 7 8 9 10
# our
baseline
system
feeds
a word
into
PB-SMT
pipeline
.
0 # 0 1 2 3 4 5 6 7 8 9 10
1 Our 1 0 1 2 3 4 5 6 7 8 9
2 baseline
2 1 0 1 2 3 4 5 6 7 8
3 system
3 2 1 0 1 2 3 4 5 6 7
4 feeds 4 3 2 1 0 1 2 3 4 5 6
5 word 5 4 3 2 1 1 1 2 3 4 5
6 into 6 5 4 3 2 2 2 1 2 3 4
7 PB-SMT
7 6 5 4 3 3 3 2 1 2 3
8 pipeline
8 7 6 5 4 4 4 3 2 1 2
9 . 9 8 7 6 5 5 5 4 3 2 1
Edit Lattice
A lattice of all the shortest paths from top-left corner to bottom-right corner
Each vertex corresponds to a cell in Levenshtein matrix
Each edge corresponds to an atomic edit operation Insert, delete, substitute, match
Each path corresponds to a shortest sequence edits that transforms into
Edit Lattice Construction
0,0
1,1
Our(1) 2,2
baseline(1)
3,3
syst
em
(1)
4,4
feed
s(1)
4,5
/a(1)
5,6
wor
d(1)
6,7
into
(1)
7,8
PB-SMT(1)
8,9
pip
elin
e(
1)
9,10
.(1
)
Edit lattice for “Our baseline system feeds () word into PB-SMT pipeline .”
Augmenting Edit Lattice
Annotators can use longer phrases and can use unchanged words from context word {a word, words} Should we allow arbitrary number of
unchanged words in an edit?▪ Avoid very large edits with many unchanged
words▪ Put limit () on number of unchanged words in
an edit
Augmenting Edit Lattice
Allow phrase level edits Add transitive edges with limit and edit 𝑢
changes at least one word Let and be two adjacent edges ()▪ Transitive edge:
Adding Transitive Edges
0,0
1,1
Our(1) 2,2
baseline(1)
3,3
syst
em
(1)
4,4
feed
s(1)
4,5
/a(1)
5,6
wor
d(1)
6,7
into
(1)
7,8
PB-SMT(1)
8,9
pip
elin
e(
1)
9,10
.(1
)
Edit lattice for “Our baseline system feeds () word into PB-SMT pipeline .”
feeds/feeds a(2)
word/a word(2)
system fe
eds/syste
m
feeds a(3)
word into/a word into (3)
feeds word/feeds a
word(3)
Favoring Gold Standard Match
0,0
1,1
Our(1) 2,2
baseline(1)
3,3
syst
em
(1)
4,4
feed
s(1)
4,5
/a(1)
5,6
wor
d(1)
6,7
into
(1)
7,8
PB-SMT(1)
8,9
pip
elin
e(
1)
9,10
.(1
)
Edit lattice for “Our baseline system feeds () word into PB-SMT pipeline .”
feeds/feeds a(2)
word/a word(2)
system fe
eds/syste
m
feeds a(3)
word into/a word into (3)
feeds word/feeds a
word(3)
Change the weight of matching edge weight to
word/a word(-45)
Finding the Optimal System Edits
Perform a single-source shortest path with negative weights from start to end vertex Bellman-Ford algorithm
Proof of Correctness
Theorem The set of edits corresponding to the
shortest path has the maximum overlap with the gold standard annotation.