Overview

Taking the Kitchen Sink Seriously:

An Ensemble Approach to Word Sense Disambiguation from

Christopher Manning et al.

Overview

● 23 student WSD projects combined in a 2-layer voting scheme (an ensemble of ensemble classifiers).

● Performed well on SENSEVAL-2: 4th place out of 21 supervised systems on the English Lexical Sample task.

● Offers some valuable lessons for both WSD and ensemble methods in general.

System Overview

● 23 different "1st order" classifiers.

– Independently developed WSD systems.

– Use a variety of algorithms (naïve bayes, n-gram, etc.).

● These 1st order classifiers combined into a variety of 2nd order classifiers/voting mechanisms.

– 2nd order classifiers vary with respect to:

● Algorithms used to combine 1st order classifiers.● Number of voters. Each takes the top k 1st order,

where k is one of {1,3,5,7,9,11,13,15} .

Voting Algorithms

● Majority vote (each vote has weight 1).

● Weighted voting, with weights determined by EM.

– Tries to choose weights that maximize the likelihood of 2nd order training instances, where the probability of a sense (given the votes) is defined as the sum of weighted votes for that sense.

● Maximum entropy using features derived from the votes of the 1st order classifiers.

Classifier Construction Process● For each word:

– Train each 1st order on ¾ of training data

– Use remaining ¼ of data to rank performance of 1st orders

– For each 2nd order classifier:

● Take the top k 1st orders for this word● Train the 2nd order on ¾ of training data using

this ensemble– Rank performance of 2nd orders with ¼ of training

data

– Take the top 2nd order as the classifier for this word. Retrain on all the training data.

Results

● 61.7% accuracy in SENSEVAL-2 competition (4th place).

● After competition, improved performance:

– Used global performance (i.e., over all words) as a tie breaker for rankings of both 1st and 2nd order .

– Improved accuracy to 63.9% (would have been 2nd).

Results for 2nd Order Classifiers

● Results are averaged over all words.

● Note MaxEnt's ability to resist dilution.

Evaluating Effects of Combination● We want different classifiers to make different mistakes.

● We can measure this differentiation as the average (over all pairs of 1st order classifiers) of the fraction of errors that are shared (error independence).

● When error independence and word difficulty grow, the advantage of combination grows.

Lessons for WSD

● Every word is a separate problem.

– All 1st and 2nd order classifiers had some words on which they did the best.

● Implementation details:

– Large or small window sizes work better than medium window sizes.

– This suggests that senses are determined on both a very local, collocational level and a very general, topical level.

– Smoothing is very important.

Lessons for Ensemble Methods

● Variety within the ensemble is desirable.

– Qualitatively different approaches are better than minor perturbations in similar approaches.

– We can measure the extent to which this ideal is achieved.

● Variety in combination algorithms helps as well.

– In particular, it can help with overfitting (because different algorithms will start overtraining at different points).

Documents

Overview