Upload
orly
View
24
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Noun Homograph Disambiguation Using Local Context in Large Text Corpora. Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004. Outline. Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions. - PowerPoint PPT Presentation
Citation preview
1
Noun Homograph Disambiguation Using Local Context in Large Text Corpora
Marti A. Hearst
Presented by: Heng JiMar. 29, 2004
2
Outline
Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions
3
Introduction
What is Homograph? One or two or more words spelled alike but different in
meaning
What is Noun Homograph Disambiguation? Determine which of a set of pre-determined senses
should be assigned to that noun
Why Noun Homograph Disambiguation is useful?
4
Noun Compound Interpretation
5
Noun Compound
Interpretation
Improve Information Retrieval Results
ORG
stick
stick
ORG
ORG
stick
stick
stick
6
Extend key words?
ORG
ORG
ORG
ORG
stick
7
How to do? -- Motivations Intuition1 Human can identify word sense by local context
Intuition2 Human’s identification ability comes from familiarity with frequent co
ntexts
Intuition3 Different senses can be distinguished by:
-- different high-frequency context
-- different syntactic, orthographic, or lexical features
Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!
8
Feature Selection
Principles: Selective & General Example: “bank” Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects are found on the west bank of the Nile [“direction”] bank of the “proper name” Headed the Chase Manhattan Bank in New York Name + Capitalization
Neighbor word not enough Need syntactic information!
9
Feature Set
10
Crucial Problem: need large annotated data?
Problem: Cost of manual tagging is high The size of corpus is usually large Statistics vary a great deal across different domains Automating the tagging of the training corpus will result in “Circularity
problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally An initial model M1, is trained using small corpus C1 M1 is used to disambiguate the rest of ambiguous words All words that can be disambiguated with strong confidence will be combined
with C1 to form C2 M2 is trained using C2; and repeat.
11
Test
Algorithm
Manually label a smallset of samples
Record context features
Training
Check context feature of target noun
Choose sense with most evidence
Input
Output
Compare Evidence
Samples with high Comparative Evidence
Segmented into phrases& POS tagging
12
Comparative Evidence Definition Max (CE) where:
and
CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i)
Procedure Choose sense with maximum comparative evidence If the largest CE is not larger than the second largest CE by threshold
the sentence cannot be classified! (Margin)
1
0
n
ii
ii
E
ECE
m
j iji fE1
1)2*(
13
Experiment Result – “tank”
Results for word "tank"
0
20
40
60
80
100
20 30 40 50 60 70Training Size
Acc
urac
y(%
) SupervisedLearning
Supervised +Unsupervised
14
Experiment Result – “bank”
Results for word "bank"
0
20
40
60
80
100
10 20 30 40 50
Training Size
Acc
urac
y(%
) SupervisedLearning
Supervised+Unsupervised
15
Experiment Result – “bass”
Result for word "bass"
0
20
40
60
80
100
10 15 20 25
Training Size
Acc
urac
y(5) Supervised
Learning
Supervised +Unsupervised
16
Experiment Result – “country”
Result for "country"
0
20
40
60
80
100
10 20 30 40
Training Size
Acc
urac
y(%
)
SupervisedLearning
Supervised +Unsupervised
17
Experiment Result – “Record”
Results for "Record" with Supervised Learning
0
20
40
60
80
100
20 30 40
Training Size
Acc
ura
cy(%
)
Record1
Record2
Record1: “archived event” “pinnacle achievement”Record2: “archived event” “musical disk”
18
Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning helps to improve general words has limitations on difficult words like “country”. also helps to reduce work amount Use of partial syntactic information: richer than common statistics tech
niques
Proposed Improvements Bootstrapping from Bilingual Corpora Improve Evidence Metric (adjust weight automatically; weight on the entire cor
pus and each sense; add more types) Integrate WordNet
19
Discussion 1: Initial Training A good training base need to be already obtained,
Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast;
This initial set is still large(20-30 occurrences for each sense) the cost of tagging is still high!
20
Discussion 2: Resources
Advantage of unrestricted corpus compared to dictionaries, includes sufficient contextual variety Can automatically integrate unfamiliar words
Assumption The context around an instance of a sense of the homograph is
meaningfully related to that sense
Need Semantic Lexicon? Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects
21
References
Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora
Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s..
Chin(1999). Word Sense Disambiguation Using Statistical Techniques
Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet
Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus