A method for unsupervised broad-coverage lexical error detection and correction

A method for unsupervised broad-coverage lexical error

detection and correction

4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop

NAACLJune 5, 2009

Nai-Lung Tsao and David WibleNational Central University, Taiwan

• Since 2000 under the support of MOE & Taipei Bureau of Education– IWiLL has been used in Taiwan by:

• 455 schools • 2,804 teachers • 161,493 students and 22,791 independent learners.• Teachers have authored 9,429 web-based lessons with the s

ystem’s authoring tool. • The learner corpus (English TLC) has archived over 32,000

English essays • 5 million words of machine-readable running text written by T

aiwan’s learners using the IWiLL writing platform. • 100,000 tokens of teacher comments on these student texts

The Research ContextIWiLL Online Writing Platform

www.iwillnow.org

Second Language Learners’Error Detection and Correction

• Lexical and Lexico-grammatical errors

- an open-ended class

- driving teachers crazy

- either no rules involved or rules of very limited productivity

Two components to our system

INPUT: user-produced string

2. Edit DistanceAlgorithm

‘on my opinion’

Compares User’s string &Hybrid N-grams

Hybrid n-grams extracted from BNC

1. Target LanguageKnowledgebase:

Error Detection/Correction

The Knowledgebase of Hybrid N-grams



What, Why, and How

What is a hybrid n-gram?

An n-gram that admit items of different levels

- Traditional n-gram: ‘in my opinion’

- Hybrid n-gram: ‘in [dps] opinion’

Why use hybrid n-grams?

- Traditional n-grams and error precision

- POS n-grams and recall

Enjoy to canoe > unattested > marked as error

Error Detection.

Enjoy canoeing> unattested > marked as error

True positive:

False positive:

V + VVgBased on attested strings like: enjoy hiking OR like watching

We could extract the POS gram: But this would accept: hope exploring

How hybrid n-grams are extracted for the knowledgebase

How the hybrid n-grams are extracted



hike VVg

V

enjoy VVd

V

enjoyed hikingword form

lexeme

[POS detailed]

{POS rough}

4 categories ofinfo for each itemIn an n-gram

Some hybrid n-grams for enjoyed hiking

enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.

Potential Hybrid N-grams for a string

Two components:

INPUT: user-produced string

2. Edit DistanceAlgorithm

‘on my opinion’

Compares User’s string &Hybrid N-grams



Error Detection/Correction

Edit Distance ComponentSteps in measuring edit distance

1. Generate all hybrid n-grams fromthe learner input string (Set C)

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.

3. Rank candidates by weighted edit distance between members of C and S

b. Prune Set S using filter factor or coverage




enjoyed hikingInput from learner:

Hybrid n-grams generated from learner string

Set C =





3. Calculate weighted edit distance between members of C and S

b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold









hikeenjoy

Target KnowledgebaseHybrid N-grams

Set S



enjoyed hiking

Set C =



hike VVg

V

enjoy VVd

V

enjoyed hiking


Set S



enjoyed hiking

Set C =



hike VVg

V

enjoy VVd

V

enjoyed hiking


Set S



enjoy hike VVg

V

hiking


Set S



hikeenjoy VVd

V

enjoyed


Set S

Pruning Set S of Candidates

enjoy + V

enjoy + VVg

100 tokens

80 tokensWe prune the subsuming Hybrid N-gramin cases where a subsumed one accounts for80% or more of the subsuming set

X

Pruning Set S of Candidates

enjoy + VVg80 tokensWe prune the subsuming Hybrid N-gram

in cases where a subsumed one accounts for80% or more of the subsuming set

Pruning of the Knowledgebase will affect error recall

The remaining Set S is filtered for frequency of member hybrid n-grams





3. Rank candidates by weighted edit distance between members of C and S

b. Prune Set S using filter factor or coverage

Weighting of Edit Distance‘enjoyed to hike’

Learner string

Generate Set Cof Hybrid N-grams

Generate Set S of Hybrid N-grams

enjoyed to hike

enjoy VVtenjoy VV to hikeVVd to hikeetc

enjoyed hikingenjoyed hikeenjoy VVgVVd hikingV hikingVVd hikeenjoy VVgenjoy learning

Distance = 1: string c and string s are identical but for one slot

Correction candidates are those with a distance 1 or lower.

Ranking of candidates withdistance = 1 from learner string

Differing element = same lexeme but diff word form is closer than different lexeme

Differing element = same rough POS but diff detailed POS is closer than diff rough POS

Examples 1C-selectionEnjoy to swim > enjoy swimming Enjoy to shop > enjoy shoppingEnjoy to canoe > enjoy canoeingEnjoy to learn > *need to learn; ?want to learn; enjoy learningEnjoy to find > *try to find; *expect to find; *fail to find; *hope to find;

*want to findHope finding > hope to findLet us to know > let us knowGet used to say > *get used to; *have used to say;

Collocation with C-selectionSpend time to fix > spend time fixing; take time to fixTake time fixing > take time to fixTake time recuperating > take time to recuperateSpend time to recuperate > spend time recuperating; take time to recuperate

Examples 2PrepositionFixed expressions:• On the outset > At the outset• In different reasons > For different reasons• In that time > at that time; by that time• On that time > at that time; by that time• On my opinion > in my opinion• In my point of view > from my point of view• I am interested of > I am interested in• She is interested of > she is interested in• I am interesting in > I am interested in• She is interesting in > She is interested in• Just on the time when > just at the time when; *just to the time when

Examples 3Preposition/Particle:Verb + preposition (particle)• Discuss to each other > *discussing to each other (should be

discuss WITH each other)• Discuss this to them > discuss this with them• Waited to her > waited for her• Waited to them > waited for them

Noun + preposition• His admiration to > his admiration for• His accomplishment on > * No suggestion• The opposite side to > the opposite side of• A crisis on > a crisis of; a crisis in• A crisis on his work > a crisis of his work (*a crisis on his

work)

Examples 4Content Word Choice• Lead a miserable living > make a miserable living

*leading a miserable living *led a miserable living lead a miserable life

• Frame of mood > ??change of mood; frame of mind;

* frame of reference

Examples 5Morpho-syntactic• She will ran > She will run• She will runs > She will run Pronoun case:• What made she change > * what made she change (no correction; • should be made HER change) Noun countability or number errors:• In modern time > in modern times Number agreement in head noun and determiner• Too much people > too many people• So much things > so many things • So many thing > so many things• One of the man > one of the men• One of the problem > one of the problems• In my opinions > in my opinion• A lot of problem > a lot of problems• Complementizer selection:

I wonder that > I wonder if; I wonder whether

Future Work

• Improving POS tagging using 2nd order model

• Machine learning of weighting for the various features determining edit distance

• Incorporation of this into our IWiLL online writing environment

• Incorporate MI for the knowledgebase’s hybrid n-grams

Thank you

Documents

A method for unsupervised broad-coverage lexical error detection and correction