Disambiguating Polysemous Queries For Document Retrieval

DISAMBIGUATING POLYSEMOUS QUERIES FOR DOCUMENT RETRIEVAL

Presented by

Varsha V. Joshi 111003016Madhusudan R. Dad 111003022Yugal S. Bagul 111003023

Under the guidance of

Dr. Mrs. Y. V. Haribhakta

College of Engineering, Pune

May, 2014

Introduction

●Need of information retrieval●Dependancy on digital sources for information retrival●Emerging importance of search Engine●Works on Natural language●Natural Language Processing is difficult for a machine●Many problems involved, WSD is one of them.

WSD

● Why WSD in document retrieval?➢ Words have multiple meanings➢ Decreases efficiency of search engine

● What is WSD?➢ WSD is a process of determining accurate sense

of the word in a sentence, when the word has more than two meanings (or senses)

WSD Approaches

● Various approaches have been proposed for disambiguation of words➢ Supervised Vs. Unsupervised ➢ Use different databases

● Some examples:

Based on magnini domain, Using Roget's Thesaurus, Based on Wordnet

Learning from the Approaches

● To disambiguate a word, ➢ All the senses of the word should be known➢ Need some measure to allocate proper sense➢ Need to expand definition of the word so that our

search engine would be able to recognize the meaning

Proposed Model

● Aim: Process a polysemous query such that search engine will be able to display relevant documents as output

● Top retrieved documents should contain highly relevant information

Proposed Model...

● We proposed the model in which➢ Noise in the query will be removed➢ Words are categorised as polysemous and non

polysemous words➢ Words will be marked with their appropriate Part of

Speech➢ Appropriate sense will be allocated to target word➢ Query expansion using most relevant words in

gloss definition

System Architecture

Tools Used

● WordNet● Solr Search Engine● WordNet Similarity Pakage● NLTK

WordNet

● What is WordNet?● Structure of WordNet● Relationships in WordNet:

➢ Hypernym, Hyponym➢ Meronym and Holonym ➢ Coordinate terms

WordNet Example..

● Consider a word 'car'● WordNet has 5 synsets for noun car● “ car, elevator car -- (where passengers ride

up and down; "the car was on the top floor")”● Relationship:

➢ Vehicle is hypernym and car is hyponym➢ Accelerator is meronym of car and car is holonym

of accelerator➢ Car and bike are coordinated terms

Solr Search Engine

WordNet-Similarity and NLTK

● WordNet-Similarity: An open source package➢ Provides similarity measures using different

methods

● NLTK: Set of libraries and programs for natural language processing ➢ POS tagging ➢ WordNet through NLTK

Lesk Algorithm● Classical algorithm for WSD● Consider an example “pine cone”● PINE

1. kinds of evergreen tree with needle-shaped leaves

2. waste away through sorrow or illness● CONE

1. solid body which narrows to a point

2. something of this shape whether solid or hollow

3. fruit of certain evergreen trees● Pine#1 ∩ cone#3 = 2

Extended Lesk Algorithm

● Proposed by Satanjeev Banerjee and Ted Pedersen

● Based on extended gloss overlap● Considering overlap between gloss input

synsets and gloss of hypernym, hyponym, meronym, holonym and troponym of input synsets

Scoring mechanism● For overlap of n words add value n^2 to the

score● Consider gloss definition of drawing paper and

decal● Drawing paper : paper that is specially

prepared for use in drafting● Decal : the art of transferring designs from

specially prepared paper to a wood or glass or metal surface

Continued...

● There are three words that overlap, paper and the two word phrase specially prepared.

● Hence score will be (1 + 4 =) 5

Computing Relatedness● Consider set of relations is

RELS = {gloss, hype, hypo}● relatedness measure between synset A and B

is computed as:

next slide...

relatedness(A,B) = score(gloss(A),gloss(B)) + score(hype(A),hype(B)) + score(hypo(A),hypo(B)) + score(hype(A),gloss(B)) +

score(gloss(A),hype(b))

e.g relatedness between temple#n#1 and build#v#1 using extended Lesk is 33 and with

simple Lesk it is 0

Continued...

Implementation:

● Query Preprocessing● Polysemy Detection● Disambiguation● Query Expansion● Document Retrieval

● Input query for demonstration:

“Bark of Pine is tough!”

Query Preprocessing:● Query refining● Compoundify

e.g. White House = White_House● POS tagging

[('bark', 'n'), ('of', None), ('pine', 'n'), ('is', 'v'), ('tough', 'a')]

● Stop word removal

[('bark', 'n'), ('pine', 'n'), ('tough', 'a')]

Polysemy Detection● Normal method based on number of synset

1. (39) United States, United States of America, America, the States, US, U.S., USA, U.S.A. -- (North American republic containing 50 states - 48 conterminous states in North America plus Alaska in northwest North America and the Hawaiian Islands in the Pacific Ocean; achieved independence in 1776)

2. (1) America -- (North America and South America and Central America)

● Use of relatedness concept for polysemy detection

Polysemy Words in input query are:

bark and tough

Disambiguation Process● Construction of window

window size = 4

bark = {pine,tough}

tough = {bark,pine}● Disambiguating individual polysemous word

using it's window words● Sense labled:

bark#n#01 and tough#a#07

Query Expansion● Use of gloss definitions● Combine gloss definitions and input query● “bark of pine is tough tough protective

covering of the woody stems and roots of trees and other woody plants resistant to cutting or chewing”

● Remove noise and duplicate words● Clustering based on relatedness● Output: “bark pine tough roots stems trees

covering plants cutting chewing

Document retrieval● Expanded query as an input to solr search

engine● Documents with maximum matching words will

be retrived first ● Testing and Evaluation

Performance Evaluation

1. Disambiguation Performance

2. Retrieval Performance

● Disambiguation is centre of the Entire model

● Effective Retrieval is result of efficient disambiguation

● Over 75% successful disambiguation.

● Table of disambiguated queries

Disambiguation Performance

Retrieval Performance

1.Average Performance rating● Unknown nature of documents

● Implies performance evaluation over heterogeneous data.

● 5 completely relevant 3 moderately relevant 1 Less relevant 0 Irrelevant

2. Standard Performance Measures

1 True Positive2 False Positive3True Negative4 False Negative

● Importance of these Measures● Inference derived from the values

3. Performance over increasing collection size● Effect of increasing size of collection

● The applications of this model needs it to perform consistently over changing size

● Behavior of the results over changing size :

● Acceptable Disambiguation performance

● Effective Query expansion

● Refinement in the result in terms of top ranked documents

● Consistency over increasing collection

Conclusion

Issues

● Requires a grammatically well formed query

● Limitations of Lesk can not be avoided

● The performance over entire set of retrieved documents

Future Direction

1. Domain assignment

2. Clustering of terms

3. Ranking of documents

Publications

● Submitted a literature survey, 'Disambiguating Polysemous Queries for Document Retrieval' in 'International Journal of Engineering,Science and Innovative Technology (IJESIT)'

● Designed a model 'Disambiguating Polysemous Queries for Document Retrieval', expecting to publish it soon

Bibliography

“Homonymy and polysemy in information retrieval” by Robert Krovetz

“Words Polysemy Analysis: Implementation of the Word Sense Disambiguation Algorithm Based On Magnini Domains” by Francis C. Fernandez-Reyes, Exiquio C. Leyva Perez, Rogelio Lau Fernandez

“Unsupervised Word Sense Disambiguation Rivaling Supervised Methods” by David Yarowsky

“Using Bilingual Materials to Develop Word Sense Disambiguation Methods” by William A. Gale , Kenneth W. Church , David Yarowsky

“A Proposal for Word Sense Disambiguation using Conceptual Distance” by Eneko Agirre and German Rigau

”WORD SENSE DISAMBIGUATION USING ID TAGS – IDENTIFYING MEANING IN POLYSEMOUS WORDS IN ENGLISH” by Nikola Dobrić , Alpen-Adria Universität Klagenfurt

WordNet: http://wordnet.princeton.edu/wordnet/documentation/

“Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora ” by David Yarowsky

Semantic Similarity: http://en.wikipedia.org/wiki/Semantic_similarity

”Semantic distance inWordNet:An experimental, application-oriented evaluation of five measures”, Alexander Budanitsky and Graeme Hirst

Rada R., Mili H., Bicknell E. and Blettner M., Development an Application of a Metric on Semantic Nets, in IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 1, 17-30. 1989

"Integrating Subject Field Codes into WordNet" by B. Magnini y G. Cavaglia

”An Extended Analysis of a Method of All Words Sense Disambiguation” by Varada Kolhatkar

“Word sense disambiguation using WordNet and the Lesk algorithm ” by Jonas EKEDAHL and Koraljka GOLUB

http://wordnet.princeton.edu/wordnet/documentation/

http://en.wikipedia.org/wiki/Semantic_similarity

●“Extended Gloss Overlaps as a Measure of Semantic Relatedness ” by Ted Pedersen and Sanjeev Banerjee●“A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge” by Gang Liu, Ruili Wang, Jeremy Buckley, Helen M. Zhou ●“Semantic similarity based on corpus statistics and lexical taxonomy” by J. Jiang and D. Conrath. ●“Using corpus statistics and WordNet relations for senseidentification” by C. Leacock, M. Chodorow, and G. Miller●“Verb semantics and lexical selection” by Z. Wu and M. Palmer●“Lexical chains as representations of context for the detection and correction ofmalapropisms” by G. Hirst and D. St-Onge●P. Resnik, WordNet and class–based probabilities. In C. Fellbaum, editor, WordNet: An electroniclexical database, pages 239–263. MIT Press, 1998●Dekang Lin. Automatic retrieval and clustering of similar words. In COLING-ACL, pages 768–774,1998●Miller, G: Special Issue, WordNet: An on-line lexical database. International Journal of Lexicography, 3(4) (1990)●WordNet: http://en.wikipedia.org/wiki/Wordnet ●P. D. Turney, “Similarity of semantic relations,” Comput. Linguist., 32(3), pp.379-416, 2006●Yarowsky, David, "One Sense Per Collocation," in Proceedings, ARPA Human Language Technology Workshop, Princeton, 1993. ●Yarowsky, David, "Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French," in Proceedings of the 32nd Annual Meeting of the Association .for Computational Linguistics, Las Cruces, NM, 1994.

http://en.wikipedia.org/wiki/Wordnet