28
A method for unsupervised broad- coverage lexical error detection and correction 4th Workshop on Innovative Uses of N LP for Building Educational Applicat ions Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Ta iwan

A method for unsupervised broad-coverage lexical error detection and correction

  • Upload
    velma

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

A method for unsupervised broad-coverage lexical error detection and correction. 4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Taiwan. The Research Context. - PowerPoint PPT Presentation

Citation preview

Page 1: A method for unsupervised broad-coverage lexical error detection and correction

A method for unsupervised broad-coverage lexical error

detection and correction

4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop

NAACLJune 5, 2009

Nai-Lung Tsao and David WibleNational Central University, Taiwan

Page 2: A method for unsupervised broad-coverage lexical error detection and correction

• Since 2000 under the support of MOE & Taipei Bureau of Education– IWiLL has been used in Taiwan by:

• 455 schools • 2,804 teachers • 161,493 students and 22,791 independent learners.• Teachers have authored 9,429 web-based lessons with the s

ystem’s authoring tool. • The learner corpus (English TLC) has archived over 32,000

English essays • 5 million words of machine-readable running text written by T

aiwan’s learners using the IWiLL writing platform. • 100,000 tokens of teacher comments on these student texts

The Research ContextIWiLL Online Writing Platform

www.iwillnow.org

Page 3: A method for unsupervised broad-coverage lexical error detection and correction

Second Language Learners’Error Detection and Correction

• Lexical and Lexico-grammatical errors

- an open-ended class

- driving teachers crazy

- either no rules involved or rules of very limited productivity

Page 4: A method for unsupervised broad-coverage lexical error detection and correction

Two components to our system

INPUT: user-produced string

2. Edit DistanceAlgorithm

‘on my opinion’

Compares User’s string &Hybrid N-grams

Hybrid n-grams extracted from BNC

1. Target LanguageKnowledgebase:

Error Detection/Correction

Page 5: A method for unsupervised broad-coverage lexical error detection and correction

The Knowledgebase of Hybrid N-grams

Hybrid n-grams extracted from BNC

1. Target LanguageKnowledgebase:

What, Why, and How

What is a hybrid n-gram?

An n-gram that admit items of different levels

- Traditional n-gram: ‘in my opinion’

- Hybrid n-gram: ‘in [dps] opinion’

Why use hybrid n-grams?

- Traditional n-grams and error precision

- POS n-grams and recall

Enjoy to canoe > unattested > marked as error

Error Detection.

Enjoy canoeing> unattested > marked as error

True positive:

False positive:

V + VVgBased on attested strings like: enjoy hiking OR like watching

We could extract the POS gram: But this would accept: hope exploring

How hybrid n-grams are extracted for the knowledgebase

Page 6: A method for unsupervised broad-coverage lexical error detection and correction

How the hybrid n-grams are extracted

Hybrid n-grams extracted from BNC

1. Target LanguageKnowledgebase:

hike VVg

V

enjoy VVd

V

enjoyed hikingword form

lexeme

[POS detailed]

{POS rough}

4 categories ofinfo for each itemIn an n-gram

Some hybrid n-grams for enjoyed hiking

enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.

Potential Hybrid N-grams for a string

Page 7: A method for unsupervised broad-coverage lexical error detection and correction

Two components:

INPUT: user-produced string

2. Edit DistanceAlgorithm

‘on my opinion’

Compares User’s string &Hybrid N-grams

Hybrid n-grams extracted from BNC

1. Target LanguageKnowledgebase:

Error Detection/Correction

Page 8: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

1. Generate all hybrid n-grams fromthe learner input string (Set C)

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.

3. Rank candidates by weighted edit distance between members of C and S

b. Prune Set S using filter factor or coverage

Page 9: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

1. Generate all hybrid n-grams fromthe learner input string (Set C)

enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.

enjoyed hikingInput from learner:

Hybrid n-grams generated from learner string

Set C =

Page 10: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

1. Generate all hybrid n-grams fromthe learner input string (Set C)

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.

3. Calculate weighted edit distance between members of C and S

b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold

Page 11: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold

Page 12: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold

Page 13: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

hikeenjoy

Target KnowledgebaseHybrid N-grams

Set S

enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.

Hybrid n-grams generated from learner string

enjoyed hiking

Set C =

Page 14: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

hike VVg

V

enjoy VVd

V

enjoyed hiking

Target KnowledgebaseHybrid N-grams

Set S

enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.

Hybrid n-grams generated from learner string

enjoyed hiking

Set C =

Page 15: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

hike VVg

V

enjoy VVd

V

enjoyed hiking

Target KnowledgebaseHybrid N-grams

Set S

Page 16: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

enjoy hike VVg

V

hiking

Target KnowledgebaseHybrid N-grams

Set S

Page 17: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

hikeenjoy VVd

V

enjoyed

Target KnowledgebaseHybrid N-grams

Set S

Page 18: A method for unsupervised broad-coverage lexical error detection and correction

Pruning Set S of Candidates

enjoy + V

enjoy + VVg

100 tokens

80 tokensWe prune the subsuming Hybrid N-gramin cases where a subsumed one accounts for80% or more of the subsuming set

X

Page 19: A method for unsupervised broad-coverage lexical error detection and correction

Pruning Set S of Candidates

enjoy + VVg80 tokensWe prune the subsuming Hybrid N-gram

in cases where a subsumed one accounts for80% or more of the subsuming set

Pruning of the Knowledgebase will affect error recall

The remaining Set S is filtered for frequency of member hybrid n-grams

Page 20: A method for unsupervised broad-coverage lexical error detection and correction

Edit Distance ComponentSteps in measuring edit distance

1. Generate all hybrid n-grams fromthe learner input string (Set C)

2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)

We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.

3. Rank candidates by weighted edit distance between members of C and S

b. Prune Set S using filter factor or coverage

Page 21: A method for unsupervised broad-coverage lexical error detection and correction

Weighting of Edit Distance‘enjoyed to hike’

Learner string

Generate Set Cof Hybrid N-grams

Generate Set S of Hybrid N-grams

enjoyed to hike

enjoy VVtenjoy VV to hikeVVd to hikeetc

enjoyed hikingenjoyed hikeenjoy VVgVVd hikingV hikingVVd hikeenjoy VVgenjoy learning

Distance = 1: string c and string s are identical but for one slot

Correction candidates are those with a distance 1 or lower.

Ranking of candidates withdistance = 1 from learner string

Differing element = same lexeme but diff word form is closer than different lexeme

Differing element = same rough POS but diff detailed POS is closer than diff rough POS

Page 22: A method for unsupervised broad-coverage lexical error detection and correction

Examples 1C-selectionEnjoy to swim > enjoy swimming Enjoy to shop > enjoy shoppingEnjoy to canoe > enjoy canoeingEnjoy to learn > *need to learn; ?want to learn; enjoy learningEnjoy to find > *try to find; *expect to find; *fail to find; *hope to find;

*want to findHope finding > hope to findLet us to know > let us knowGet used to say > *get used to; *have used to say;

Collocation with C-selectionSpend time to fix > spend time fixing; take time to fixTake time fixing > take time to fixTake time recuperating > take time to recuperateSpend time to recuperate > spend time recuperating; take time to recuperate

Page 23: A method for unsupervised broad-coverage lexical error detection and correction

Examples 2PrepositionFixed expressions:• On the outset > At the outset• In different reasons > For different reasons• In that time > at that time; by that time• On that time > at that time; by that time• On my opinion > in my opinion• In my point of view > from my point of view• I am interested of > I am interested in• She is interested of > she is interested in• I am interesting in > I am interested in• She is interesting in > She is interested in• Just on the time when > just at the time when; *just to the time when

Page 24: A method for unsupervised broad-coverage lexical error detection and correction

Examples 3Preposition/Particle:Verb + preposition (particle)• Discuss to each other > *discussing to each other (should be

discuss WITH each other)• Discuss this to them > discuss this with them• Waited to her > waited for her• Waited to them > waited for them

Noun + preposition• His admiration to > his admiration for• His accomplishment on > * No suggestion• The opposite side to > the opposite side of• A crisis on > a crisis of; a crisis in• A crisis on his work > a crisis of his work (*a crisis on his

work)

Page 25: A method for unsupervised broad-coverage lexical error detection and correction

Examples 4Content Word Choice• Lead a miserable living > make a miserable living

*leading a miserable living *led a miserable living lead a miserable life

• Frame of mood > ??change of mood; frame of mind;

* frame of reference

Page 26: A method for unsupervised broad-coverage lexical error detection and correction

Examples 5Morpho-syntactic• She will ran > She will run• She will runs > She will run Pronoun case:• What made she change > * what made she change (no correction; • should be made HER change) Noun countability or number errors:• In modern time > in modern times Number agreement in head noun and determiner• Too much people > too many people• So much things > so many things • So many thing > so many things• One of the man > one of the men• One of the problem > one of the problems• In my opinions > in my opinion• A lot of problem > a lot of problems• Complementizer selection:

I wonder that > I wonder if; I wonder whether

Page 27: A method for unsupervised broad-coverage lexical error detection and correction

Future Work

• Improving POS tagging using 2nd order model

• Machine learning of weighting for the various features determining edit distance

• Incorporation of this into our IWiLL online writing environment

• Incorporate MI for the knowledgebase’s hybrid n-grams

Page 28: A method for unsupervised broad-coverage lexical error detection and correction

Thank you