Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff

Evaluating Grammatical Error Detection and Correction

Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff The CoNLL-2013 Shared Task on Grammatical Error Correction, Ng el al. Better Evaluation for Grammatical Error Correction, Dahlmeier and Ng

Comparison to Standard NLP Tasks

Annotator tag

Syst

em

outp

ut

Annotator tag

Syst

em

outp

ut

Lear

ner

sent

ence

Standard NLP evaluation

Error detection evaluation

Resource: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.

Traditional Evaluation Measures

Comma restoration task Commas are removed from well edited

text (gold standard) System tries to restore commas by

predicting their locations Comparison:▪ Binary distinction (presence or absence of

comma)

Traditional Evaluation Measures

Comma restoration task Comparison can be represented in

contingency table

Accuracy (A) Precision (P) Recall (R) F-measure (F1)

3-Way Contingency Table in Error Detection

Comma error detection task System seeks to find and correct errors

in the write’s usage of commas Intricacies:▪ Positive class: Error of the writer that involves

comma (not presence of comma) Mismatch between writer’s sentence and the annotator’s judgement▪ Negative class: writer and annotator agree▪ System’s judgement has not been considered

yet▪ Writer-Annotator-System (WAS)


Contingency scheme for WAS

Considering System prediction and Writer’s form together


Contingency scheme for WAS

Considering System prediction and Gold standard together



Simplified contingency scheme


The case of * row in simplified WAS contingency table Concerns different categories (TP,TN,FP,FN)

depending on whether the evaluation is for detection or correction▪ TP for detection ()▪ For correction: X Y Z is both FP (system writer) and

FN(system annotator)

Problem of Skewed Data

Distribution of positive and negative error classes are highly skewed 13% errors in preposition usages by L2

writers (Han el al., 2006) Baseline system always predicts “no

errors”▪ 87% accuracy

All the measures are affected by the proportion of errors in gold standard▪ Prevalence


All the measures are affected by the proportion of cases that system reports as error Bias

Effect on a system that performs no better than chance

Increase in R when prevalence increases

Increase in P when bias increases


• Expected match between Annotator and System product of their probabilities for respective categories (in this case Error/No-Error)

• the expected proportion of TP matches is equal to the product of the proportion of cases assigned the Error label by the Annotator (i.e., the prevalence) and the proportion of cases assigned the Error label by the System (i.e., the bias)

Expected proportion of TP match


• Cohen’s kappa

• Accuracy = 0.68, Precision = 0.04/(0.04+0.16) = 0.2, Recall = 0.2 and F1= 0.2

System1: Predications are correct at chance level


• Cohen’s kappa

• Accuracy = 0.80

• Removing the cases expected to show agreement by chance, the System is correct in 38% remaining cases

System2: Prevalence and bias remain the same


• Cohen’s kappa

• Accuracy = 0.54, Precision = 0.40, Recall = 0.30, F1 = 0.34

System3: Increase bias and prevalence + Predications are correct at chance level


Variability in prevalence or error rates () Prevalence changes with population of

learners with different native languages Different levels of proficiency in second

language Variability in bias ()

Detection system dependent Threshold for marking Error/Non-Error▪ Higher threshold lower bias▪ Lower threshold higher bias

Problem of Skewed Data: Remedies Dealing with sensitivity to bias

Vary threshold and generate precision-recall curve


Area under Receiver Operating Characteristic (AUC) curve

False Positive Rate p(true|false)

True P

osi

tive R

ate

p

(tru

e|t

rue)

45𝑜

curv

e for r

andom p

redict

ion

Effect of random prediction is not nullified

Area under random prediction


Area under curve (AUK)

False Positive Rate p(true|false)

Cohen’s

Class skewedness is already taken care of

False Positive Rate

True P

osi

tive R

ate

Counting Positives or Errors

Positive class consists of an error in writer’s text No 1:1:1 correspondence between

writer’s sentence, annotator’s correction and type of error

Book of my class inpired me

A Book in my class inspired meBooks for my class inspired meThe books of my class were inspiring to me

Article error

Number error

Article+Number error


Assuming no ambiguity in error type What would be the size of unit over

which error is defined?

The book in my class inspire me

a) The book in my class inspires meb) The books in my class inspire me

• Unit size: Morpheme level? Word level? Phrase level? String level?

• Token-based approach vs String-based approach


Variability of size can be handled with Edit Distance Measures (EDM) inspire inspires is same as book…

inspire book… inspires EDM can handle multiple overlapping

errors Sequence: “…development set is similar

with test set…..” Correction1: with to and the Correction2: with to the

EDM can handle both


EDMs are good for comparison not for providing feedback to the writer If book and inspire are not linked

feedback like violation in subject-verb agreement cannot be provided

Counting Negatives or Non-Errors

Negatives consist of non-errors in writer’s text set complement of positive class? Appropriate set of non-errors cannot be

easily specified▪ Book of my class inspire to me▪ Negatives: a of, a my, a class, a inspire, a to,

a me, a .?▪ Should only the noun phrases be counted?▪ He is fond beer . ( positions or positions)


Error data is biased towards negative class Negative counting strategy have greater

consequences in performance reporting Identifying negatives through trivial

means results inflation in true negatives (TN), keeping other counts in contingency table constant▪ Increases P, R, A,


Accuracy: 0.54, Kappa = 0.00

Accuracy: 0.77, Kappa = 0.21

Inject 100 more TNs

Helping Our Own (HOO) and CoNLL Evaluation

Given: Gold standard (G), System Edits (E)

Example Learner sentence S▪ There is no a doubt, tracking system has brought many

benefits in this information age

System correction H▪ There is no doubt, tracking system has brought many

benefits in this information age .

Performance▪ P=1/1, R=1/3 F=1/2

HOO Evaluation

HOO Evaluation

Extraction of system edit from writer’s text (source) and system output (hypothesis) done with GNU wdiff utilitySource: Our baseline system feeds word into PB-

SMT pipelineHypothesis: Our baseline system feeds a word into PB-SMT pipelineSystem edit: () inserting article a

Gold standard edit: ()

Hypothesis matches with first gold standard edit but flagged as invalid

HOO Evaluation: Modification

Key idea There may be multiple ways to arrive at

the same correction Extraction of the set of edits that

matches the gold standard maximally

MaxMatch () Algorithm

Notations : set of writer sentences : set of hypothesis or system outputs : set of gold standard annotations▪ : set of edits


Notations An edit is a tripple <a,b,C>▪ Start and end token offsets a and b with

respect to a source sentence.▪ A correction C. ▪ For gold standard edit C is set of corrections▪ For system edit C is a single correction


Evaluation of system output Extracting a set of system edits () for

each source-hypothesis pair (-)▪ Construction of edit lattice▪ Searching through the lattice for extracting

optimal set of edits Evaluating system edits with respect to

gold standard

Edit Lattice

Edit metric: Levenshtein distance Minimum number of insertions, deletions

and substitutions needed to transform one string to another

How to compute levenshtein distance?▪ Use a 2-D matrix (Levenstein matrix) to store

edit costs of substrings of string pairs▪ Compute individual cell entries (edit costs)

with dynamic programming▪ Rightmost corner cell stores optimal edit cost

Levenstein Edit Metric

Slides from Jurafsky course page

How similar are two strings?

Spell correction The user typed

“graffe”Which is closest? ▪ graf▪ graft▪ grail▪ giraffe

• Computational Biology• Align two sequences of

nucleotides

• Resulting alignment:

• Also for Machine Translation, Information Extraction, Speech Recognition

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Edit Distance

The minimum edit distance between two strings

Is the minimum number of editing operations Insertion Deletion Substitution

Needed to transform one into the other

Minimum Edit Distance

Two strings and their alignment:

Minimum Edit Distance

If each operation has cost of 1 Distance between these is 5

If substitutions cost 2 (Levenshtein) Distance between them is 8

How to find the Min Edit Distance? Searching for a path (sequence of

edits) from the start string to the final string: Initial state: the word we’re

transforming Operators: insert, delete, substitute Goal state: the word we’re trying to

get to Path cost: what we want to minimize:

the number of edits

Minimum Edit as Search

But the space of all edit sequences is huge! We can’t afford to navigate naïvely Lots of distinct paths wind up at the

same state.▪ We don’t have to keep track of all of them▪ Just the shortest path to each of those

revisited states.

Defining Min Edit Distance For two strings

X of length n Y of length m

We define D(i,j) the edit distance between X[1..i] and Y[1..j] ▪ i.e., the first i characters of X and the first j

characters of Y The edit distance between X and Y is thus

D(n,m)

Dynamic Programming forMinimum Edit Distance

Dynamic programming: Solving problems by combining solutions to

subproblems. A tabular computation of D(n,m)

Bottom-up We compute D(i,j) for small i,j And compute larger D(i,j) based on previously

computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0

< j < m)

Defining Min Edit Distance (Levenshtein)

InitializationD(i,0) = iD(0,j) = j

Recurrence RelationFor each i = 1…M

For each j = 1…N

D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j)

TerminationD(N,M) is distance

insertion

deletion

substitution

N 9

O 8

I 7

T 6

N 5

E 4

T 3

N 2

I 1

# 0 1 2 3 4 5 6 7 8 9

# E X E C U T I O N

The Edit Distance Table

N 9

O 8

I 7

T 6

N 5

E 4

T 3

N 2

I 1 2

# 0 1 2 3 4 5 6 7 8 9

# E X E C U T I O N


D(0,1) + 1D(1,1)= min D(1,0) + 1 D(0,0) + 2; if X(1) ≠ Y(1) 0; if X(1) = Y(1)

N 9 8 9 10 11 12 11 10 9 8

O 8 7 8 9 10 11 10 9 8 9

I 7 6 7 8 9 10 9 8 9 10

T 6 5 6 7 8 9 8 9 10 11

N 5 4 5 6 7 8 9 10 11 10

E 4 3 4 5 6 7 8 9 10 9

T 3 4 5 6 7 8 7 8 9 8

N 2 3 4 5 6 7 8 7 8 7

I 1 2 3 4 5 6 7 6 7 8

# 0 1 2 3 4 5 6 7 8 9

# E X E C U T I O N


Computing alignments

Edit distance isn’t sufficient We often need to align each character

of the two strings to each other We do this by keeping a “backtrace” Every time we enter a cell,

remember where we came from When we reach the end,

Trace back the path from the upper right corner to read off the alignment

N 9 8 9 10 11 12 11 10 9 8

O 8 7 8 9 10 11 10 9 8 9

I 7 6 7 8 9 10 9 8 9 10

T 6 5 6 7 8 9 8 9 10 11

N 5 4 5 6 7 8 9 10 11 10

E 4 3 4 5 6 7 8 9 10 9

T 3 4 5 6 7 8 7 8 9 8

N 2 3 4 5 6 7 8 7 8 7

I 1 2 3 4 5 6 7 6 7 8

# 0 1 2 3 4 5 6 7 8 9

# E X E C U T I O N


Adding Backtrace to Minimum Edit Distance

Base conditions: Termination:D(i,0) = i D(0,j) = j D(N,M) is distance

Recurrence Relation:For each i = 1…M

For each j = 1…N

D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j)

LEFT ptr(i,j)= DOWN DIAG

insertion

deletion

substitution

insertion

deletion

substitution

MinEdit with Backtrace

Alignment

Back to MaxMatch

Source: Our baseline system feeds word into PB-SMT pipelineHypothesis: Our baseline system feeds a word into PB-SMT pipelineSystem edit: () inserting article a

Gold standard edit: ()

Levenshtein Matrix Construction

0 1 2 3 4 5 6 7 8 9 10

# our

baseline

system

feeds

a word

into

PB-SMT

pipeline

.

0 #

1 Our

2 baseline

3 system

4 feeds

5 word

6 into

7 PB-SMT

8 pipeline

9 .

Levenshtein Matrix Construction

0 1 2 3 4 5 6 7 8 9 10

# our

baseline

system

feeds

a word

into

PB-SMT

pipeline

.

0 # 0 1 2 3 4 5 6 7 8 9 10

1 Our 1 0 1 2 3 4 5 6 7 8 9

2 baseline

2 1 0 1 2 3 4 5 6 7 8

3 system

3 2 1 0 1 2 3 4 5 6 7

4 feeds 4 3 2 1 0 1 2 3 4 5 6

5 word 5 4 3 2 1 1 1 2 3 4 5

6 into 6 5 4 3 2 2 2 1 2 3 4

7 PB-SMT

7 6 5 4 3 3 3 2 1 2 3

8 pipeline

8 7 6 5 4 4 4 3 2 1 2

9 . 9 8 7 6 5 5 5 4 3 2 1

Finding Shortest Paths

0 1 2 3 4 5 6 7 8 9 10

# our

baseline

system

feeds

a word

into

PB-SMT

pipeline

.

0 # 0 1 2 3 4 5 6 7 8 9 10

1 Our 1 0 1 2 3 4 5 6 7 8 9

2 baseline

2 1 0 1 2 3 4 5 6 7 8

3 system

3 2 1 0 1 2 3 4 5 6 7

4 feeds 4 3 2 1 0 1 2 3 4 5 6

5 word 5 4 3 2 1 1 1 2 3 4 5

6 into 6 5 4 3 2 2 2 1 2 3 4

7 PB-SMT

7 6 5 4 3 3 3 2 1 2 3

8 pipeline

8 7 6 5 4 4 4 3 2 1 2

9 . 9 8 7 6 5 5 5 4 3 2 1

Edit Lattice

A lattice of all the shortest paths from top-left corner to bottom-right corner

Each vertex corresponds to a cell in Levenshtein matrix

Each edge corresponds to an atomic edit operation Insert, delete, substitute, match

Each path corresponds to a shortest sequence edits that transforms into

Edit Lattice Construction

0,0

1,1

Our(1) 2,2

baseline(1)

3,3

syst

em

(1)

4,4

feed

s(1)

4,5

/a(1)

5,6

wor

d(1)

6,7

into

(1)

7,8

PB-SMT(1)

8,9

pip

elin

e(

1)

9,10

.(1

)

Edit lattice for “Our baseline system feeds () word into PB-SMT pipeline .”

Augmenting Edit Lattice

Annotators can use longer phrases and can use unchanged words from context word {a word, words} Should we allow arbitrary number of

unchanged words in an edit?▪ Avoid very large edits with many unchanged

words▪ Put limit () on number of unchanged words in

an edit

Augmenting Edit Lattice

Allow phrase level edits Add transitive edges with limit and edit 𝑢

changes at least one word Let and be two adjacent edges ()▪ Transitive edge:

Adding Transitive Edges

0,0

1,1

Our(1) 2,2

baseline(1)

3,3

syst

em

(1)

4,4

feed

s(1)

4,5

/a(1)

5,6

wor

d(1)

6,7

into

(1)

7,8

PB-SMT(1)

8,9

pip

elin

e(

1)

9,10

.(1

)


feeds/feeds a(2)

word/a word(2)

system fe

eds/syste

m

feeds a(3)

word into/a word into (3)

feeds word/feeds a

word(3)

Favoring Gold Standard Match

0,0

1,1

Our(1) 2,2

baseline(1)

3,3

syst

em

(1)

4,4

feed

s(1)

4,5

/a(1)

5,6

wor

d(1)

6,7

into

(1)

7,8

PB-SMT(1)

8,9

pip

elin

e(

1)

9,10

.(1

)


feeds/feeds a(2)

word/a word(2)

system fe

eds/syste

m

feeds a(3)

word into/a word into (3)

feeds word/feeds a

word(3)

Change the weight of matching edge weight to

word/a word(-45)

Finding the Optimal System Edits

Perform a single-source shortest path with negative weights from start to end vertex Bellman-Ford algorithm

Proof of Correctness

Theorem The set of edits corresponding to the

shortest path has the maximum overlap with the gold standard annotation.

Proof of Correctness

Proof Let be the edit sequence in shortest path and be the

number of matching edits be another edit sequence with higher path cost but

Bound on right hand side

Bound on left hand side▪ ▪ LHS is

Contradiction▪

Documents

Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff