Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon

Automatic acquisition for low frequency lexical items

Nuria Bel, Sergio Espeja, Montserrat Marimon

Lexical acquisition

• Induction of information about the linguistic properties of lexical items on the basis of its occurrences in texts.

• Historically, two lines of research– Induction of patterns of behavior predict

classes and its members.– Linguistically justified patterns are sought as

cues for classifying words into predefined classes.

The problem

• Both approaches use as much data as possible and run into problems with low frequency items (patterns or words).

• Brent (1993): a pattern that is more frequent than others cannot be hazard (noise).

• According to Zipf’s principle (1935), most of the words in any length corpus will occur few times, even just one time.

• And informative patterns have very unbalanced distribution, for instance in a corpus of more 3M: – “applicable” occurs 440 and 37% with “to”– “favorable” occurs 60 and 5% with “to”

State of the art

• Trying to eliminate noise with preprocessed text (Briscoe and Caroll, 1997; Korhonen, 2002) and different statistical filters.

• Using linguistic generalizations for better distribution (Chelsey and Salmon-Alt, 2006 and Preiss et al., 2007).

• Improve precision (80% approx.) but not recall (50% approx.)

State of the art (2)

• Taking Levin classes based on distributional differences, which are taken as cues, Merlo and Stevenson (2001) and Joanis et al. (2007). obtained 70% of accuracy with a Decision Tree.

• Korhonen 2002, used WordNet classes to get probabilities to smooth its statistical filters and obtained 71.2% recall.

• Linguistic information helps because it is independent of frequency information.

• But the cues are uncertain: optional occurrences of cues (silence or negative), cues shared by two classes …

Our proposal:Classes, features and cues

To compensate lack of occurrences with linguistic knowledge

• Lexical classes are based in particular properties, i.e. grammatical features, each having one or more cues.

• We have built a Bayesian classifier for each grammatical feature instead than for a class.

• We use probabilistic information obtained from the linguistic definition of classes, instead than from data: it is not affected by frequency phenomena and unbalanced classes.

A probabilistic version of lexical classes

If linguistic classes are based on different grammatical features (typed grammars), and each feature can be related to different contexts of occurrence (cues), then we can predict what are the contexts where a noun having a particular class will appear without looking at the data.


P(count|trans) = 6/6

P(mass|trans) = 3/6


• Linguistic classes provide us with the likelihood information, P(v|sf), that has to be normally calculated with sample data and which is affected by Zipf frequency phenomena

• We can tune this likelihood because of known characteristics: uncertainty of silence

Assigning features to words (Anderson, 1991 and Xu and Tanenbaum, 2007):

Z: σ→ SF, where σ is a word’s signature, the set of its occurrences = {v1, v2, ..., vz}, in a given corpus.

z

j jvxisfPxisfZ )|,(),,('

kkikij

xixijjxi sfPsfvP

sfPsfvPvsfP

)()|(

)()|()|(

,,

,,,

m

lxilxij sflcPsfvP

1,, )|()|(

no

yesxisfZ

noxisfZ

yesnoxi

sfZyesxi

sfZ

Z)|

,(')|

,('

)|,

(')|,

('

Evaluation

• CT-IULA, 1M pos tagged words approx.

• Gold-standard was the manually encoded lexica from Spanish Resource Grammar.

• A baseline of a majority based classifier as computed from the gold-standard lexica

• Same materials used for Bel et al. (2007) to compare with a Decision Tree (Weka, Witten and Frank, 2005).

ResultsTest set of 50 randomly chosen Spanish nouns occurring just once in the corpus

AccuracyBaseline Z

trans 0.58 0.88 intrans 0.56 0.80 mass 0.66 0.72 pcomp 0.90 0.84

count 1.00 0.98

Total 0.74 0.84

Compared with DT (Bel et al. 2007)

DT Z Prec. Rec. F1 Prec. Rec. F1

- 50 nouns occurring just oncetrans 0.75 0.46 0.57 0.94 0.76 0.84intrans 0.85 0.95 0.89 0.78 0.89 0.83mass 0.50 0.16 0.25 0.71 0.29 0.41pcomp 0.00 0.00 0.00 0.28 0.40 0.33count 0.97 1.00 0.98 1.0 0.98 0.98

- 289 randomly chosen Spanish nounstrans 0.73 0.45 0.55 0.77 0.37 0.50 intrans 0.84 0.94 0.89 0.48 0.67 0.56 mass 0.40 0.26 0.31 0.62 0.20 0.30 pcomp 0.40 0.08 0.13 0.44 0.33 0.37 count 0.97 0.99 0.98 0.89 0.96 0.92

Conclusions

Our general conclusion, based on these experiments, is that linguistic knowledge, obtained by abstraction and generalization, can be used in conjunction with most powerful methods and techniques based on probabilistic methods to overcome the problem of the unbalanced distribution of linguistic data in particular, and the acquisition of lexical information in general.

Documents

Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon