Upload
madhusudan-daad
View
141
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This presentation contains our proposed model for disambiguating polysemous queries which is totally based on Extended Lesk Algorithm
Citation preview
DISAMBIGUATING POLYSEMOUS QUERIES FOR DOCUMENT RETRIEVAL
Presented by
Varsha V. Joshi 111003016Madhusudan R. Dad 111003022Yugal S. Bagul 111003023
Under the guidance of
Dr. Mrs. Y. V. Haribhakta
College of Engineering, Pune
May, 2014
Introduction
●Need of information retrieval●Dependancy on digital sources for information retrival●Emerging importance of search Engine●Works on Natural language●Natural Language Processing is difficult for a machine●Many problems involved, WSD is one of them.
WSD
● Why WSD in document retrieval?➢ Words have multiple meanings➢ Decreases efficiency of search engine
● What is WSD?➢ WSD is a process of determining accurate sense
of the word in a sentence, when the word has more than two meanings (or senses)
WSD Approaches
● Various approaches have been proposed for disambiguation of words➢ Supervised Vs. Unsupervised ➢ Use different databases
● Some examples:
Based on magnini domain, Using Roget's Thesaurus, Based on Wordnet
Learning from the Approaches
● To disambiguate a word, ➢ All the senses of the word should be known➢ Need some measure to allocate proper sense➢ Need to expand definition of the word so that our
search engine would be able to recognize the meaning
Proposed Model
● Aim: Process a polysemous query such that search engine will be able to display relevant documents as output
● Top retrieved documents should contain highly relevant information
Proposed Model...
● We proposed the model in which➢ Noise in the query will be removed➢ Words are categorised as polysemous and non
polysemous words➢ Words will be marked with their appropriate Part of
Speech➢ Appropriate sense will be allocated to target word➢ Query expansion using most relevant words in
gloss definition
System Architecture
Tools Used
● WordNet● Solr Search Engine● WordNet Similarity Pakage● NLTK
WordNet
● What is WordNet?● Structure of WordNet● Relationships in WordNet:
➢ Hypernym, Hyponym➢ Meronym and Holonym ➢ Coordinate terms
WordNet Example..
● Consider a word 'car'● WordNet has 5 synsets for noun car● “ car, elevator car -- (where passengers ride
up and down; "the car was on the top floor")”● Relationship:
➢ Vehicle is hypernym and car is hyponym➢ Accelerator is meronym of car and car is holonym
of accelerator➢ Car and bike are coordinated terms
Solr Search Engine
WordNet-Similarity and NLTK
● WordNet-Similarity: An open source package➢ Provides similarity measures using different
methods
● NLTK: Set of libraries and programs for natural language processing ➢ POS tagging ➢ WordNet through NLTK
Lesk Algorithm● Classical algorithm for WSD● Consider an example “pine cone”● PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness● CONE
1. solid body which narrows to a point
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees● Pine#1 ∩ cone#3 = 2
Extended Lesk Algorithm
● Proposed by Satanjeev Banerjee and Ted Pedersen
● Based on extended gloss overlap● Considering overlap between gloss input
synsets and gloss of hypernym, hyponym, meronym, holonym and troponym of input synsets
Scoring mechanism● For overlap of n words add value n^2 to the
score● Consider gloss definition of drawing paper and
decal● Drawing paper : paper that is specially
prepared for use in drafting● Decal : the art of transferring designs from
specially prepared paper to a wood or glass or metal surface
Continued...
● There are three words that overlap, paper and the two word phrase specially prepared.
● Hence score will be (1 + 4 =) 5
Computing Relatedness● Consider set of relations is
RELS = {gloss, hype, hypo}● relatedness measure between synset A and B
is computed as:
next slide...
relatedness(A,B) = score(gloss(A),gloss(B)) + score(hype(A),hype(B)) + score(hypo(A),hypo(B)) + score(hype(A),gloss(B)) +
score(gloss(A),hype(b))
e.g relatedness between temple#n#1 and build#v#1 using extended Lesk is 33 and with
simple Lesk it is 0
Continued...
Implementation:
● Query Preprocessing● Polysemy Detection● Disambiguation● Query Expansion● Document Retrieval
● Input query for demonstration:
“Bark of Pine is tough!”
Query Preprocessing:● Query refining● Compoundify
e.g. White House = White_House● POS tagging
[('bark', 'n'), ('of', None), ('pine', 'n'), ('is', 'v'), ('tough', 'a')]
● Stop word removal
[('bark', 'n'), ('pine', 'n'), ('tough', 'a')]
Polysemy Detection● Normal method based on number of synset
1. (39) United States, United States of America, America, the States, US, U.S., USA, U.S.A. -- (North American republic containing 50 states - 48 conterminous states in North America plus Alaska in northwest North America and the Hawaiian Islands in the Pacific Ocean; achieved independence in 1776)
2. (1) America -- (North America and South America and Central America)
● Use of relatedness concept for polysemy detection
Polysemy Words in input query are:
bark and tough
Disambiguation Process● Construction of window
window size = 4
bark = {pine,tough}
tough = {bark,pine}● Disambiguating individual polysemous word
using it's window words● Sense labled:
bark#n#01 and tough#a#07
Query Expansion● Use of gloss definitions● Combine gloss definitions and input query● “bark of pine is tough tough protective
covering of the woody stems and roots of trees and other woody plants resistant to cutting or chewing”
● Remove noise and duplicate words● Clustering based on relatedness● Output: “bark pine tough roots stems trees
covering plants cutting chewing
Document retrieval● Expanded query as an input to solr search
engine● Documents with maximum matching words will
be retrived first ● Testing and Evaluation
Performance Evaluation
1. Disambiguation Performance
2. Retrieval Performance
● Disambiguation is centre of the Entire model
● Effective Retrieval is result of efficient disambiguation
● Over 75% successful disambiguation.
● Table of disambiguated queries
Disambiguation Performance
Retrieval Performance
1.Average Performance rating● Unknown nature of documents
● Implies performance evaluation over heterogeneous data.
● 5 completely relevant 3 moderately relevant 1 Less relevant 0 Irrelevant
2. Standard Performance Measures
1 True Positive2 False Positive3True Negative4 False Negative
● Importance of these Measures● Inference derived from the values
3. Performance over increasing collection size● Effect of increasing size of collection
● The applications of this model needs it to perform consistently over changing size
● Behavior of the results over changing size :
● Acceptable Disambiguation performance
● Effective Query expansion
● Refinement in the result in terms of top ranked documents
● Consistency over increasing collection
Conclusion
Issues
● Requires a grammatically well formed query
● Limitations of Lesk can not be avoided
● The performance over entire set of retrieved documents
Future Direction
1. Domain assignment
2. Clustering of terms
3. Ranking of documents
Publications
● Submitted a literature survey, 'Disambiguating Polysemous Queries for Document Retrieval' in 'International Journal of Engineering,Science and Innovative Technology (IJESIT)'
● Designed a model 'Disambiguating Polysemous Queries for Document Retrieval', expecting to publish it soon
Bibliography
“Homonymy and polysemy in information retrieval” by Robert Krovetz
“Words Polysemy Analysis: Implementation of the Word Sense Disambiguation Algorithm Based On Magnini Domains” by Francis C. Fernandez-Reyes, Exiquio C. Leyva Perez, Rogelio Lau Fernandez
“Unsupervised Word Sense Disambiguation Rivaling Supervised Methods” by David Yarowsky
“Using Bilingual Materials to Develop Word Sense Disambiguation Methods” by William A. Gale , Kenneth W. Church , David Yarowsky
“A Proposal for Word Sense Disambiguation using Conceptual Distance” by Eneko Agirre and German Rigau
”WORD SENSE DISAMBIGUATION USING ID TAGS – IDENTIFYING MEANING IN POLYSEMOUS WORDS IN ENGLISH” by Nikola Dobrić , Alpen-Adria Universität Klagenfurt
WordNet: http://wordnet.princeton.edu/wordnet/documentation/
“Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora ” by David Yarowsky
Semantic Similarity: http://en.wikipedia.org/wiki/Semantic_similarity
”Semantic distance inWordNet:An experimental, application-oriented evaluation of five measures”, Alexander Budanitsky and Graeme Hirst
Rada R., Mili H., Bicknell E. and Blettner M., Development an Application of a Metric on Semantic Nets, in IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 1, 17-30. 1989
"Integrating Subject Field Codes into WordNet" by B. Magnini y G. Cavaglia
”An Extended Analysis of a Method of All Words Sense Disambiguation” by Varada Kolhatkar
“Word sense disambiguation using WordNet and the Lesk algorithm ” by Jonas EKEDAHL and Koraljka GOLUB
●“Extended Gloss Overlaps as a Measure of Semantic Relatedness ” by Ted Pedersen and Sanjeev Banerjee●“A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge” by Gang Liu, Ruili Wang, Jeremy Buckley, Helen M. Zhou ●“Semantic similarity based on corpus statistics and lexical taxonomy” by J. Jiang and D. Conrath. ●“Using corpus statistics and WordNet relations for senseidentification” by C. Leacock, M. Chodorow, and G. Miller●“Verb semantics and lexical selection” by Z. Wu and M. Palmer●“Lexical chains as representations of context for the detection and correction ofmalapropisms” by G. Hirst and D. St-Onge●P. Resnik, WordNet and class–based probabilities. In C. Fellbaum, editor, WordNet: An electroniclexical database, pages 239–263. MIT Press, 1998●Dekang Lin. Automatic retrieval and clustering of similar words. In COLING-ACL, pages 768–774,1998●Miller, G: Special Issue, WordNet: An on-line lexical database. International Journal of Lexicography, 3(4) (1990)●WordNet: http://en.wikipedia.org/wiki/Wordnet ●P. D. Turney, “Similarity of semantic relations,” Comput. Linguist., 32(3), pp.379-416, 2006●Yarowsky, David, "One Sense Per Collocation," in Proceedings, ARPA Human Language Technology Workshop, Princeton, 1993. ●Yarowsky, David, "Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French," in Proceedings of the 32nd Annual Meeting of the Association .for Computational Linguistics, Las Cruces, NM, 1994.