43
DISAMBIGUATING POLYSEMOUS QUERIES FOR DOCUMENT RETRIEVAL Presented by Varsha V. Joshi 111003016 Madhusudan R. Dad 111003022 Yugal S. Bagul 111003023 Under the guidance of Dr. Mrs. Y. V. Haribhakta College of Engineering, Pune May, 2014

Disambiguating Polysemous Queries For Document Retrieval

Embed Size (px)

DESCRIPTION

This presentation contains our proposed model for disambiguating polysemous queries which is totally based on Extended Lesk Algorithm

Citation preview

Page 1: Disambiguating Polysemous Queries For Document Retrieval

DISAMBIGUATING POLYSEMOUS QUERIES FOR DOCUMENT RETRIEVAL

Presented by

Varsha V. Joshi 111003016Madhusudan R. Dad 111003022Yugal S. Bagul 111003023

Under the guidance of

Dr. Mrs. Y. V. Haribhakta

College of Engineering, Pune

May, 2014

Page 2: Disambiguating Polysemous Queries For Document Retrieval

Introduction

●Need of information retrieval●Dependancy on digital sources for information retrival●Emerging importance of search Engine●Works on Natural language●Natural Language Processing is difficult for a machine●Many problems involved, WSD is one of them.

Page 3: Disambiguating Polysemous Queries For Document Retrieval

WSD

● Why WSD in document retrieval?➢ Words have multiple meanings➢ Decreases efficiency of search engine

● What is WSD?➢ WSD is a process of determining accurate sense

of the word in a sentence, when the word has more than two meanings (or senses)

Page 4: Disambiguating Polysemous Queries For Document Retrieval

WSD Approaches

● Various approaches have been proposed for disambiguation of words➢ Supervised Vs. Unsupervised ➢ Use different databases

● Some examples:

Based on magnini domain, Using Roget's Thesaurus, Based on Wordnet

 

Page 5: Disambiguating Polysemous Queries For Document Retrieval

Learning from the Approaches

● To disambiguate a word, ➢ All the senses of the word should be known➢ Need some measure to allocate proper sense➢ Need to expand definition of the word so that our

search engine would be able to recognize the meaning

Page 6: Disambiguating Polysemous Queries For Document Retrieval

Proposed Model

● Aim: Process a polysemous query such that search engine will be able to display relevant documents as output

● Top retrieved documents should contain highly relevant information

Page 7: Disambiguating Polysemous Queries For Document Retrieval

Proposed Model...

● We proposed the model in which➢ Noise in the query will be removed➢ Words are categorised as polysemous and non

polysemous words➢ Words will be marked with their appropriate Part of

Speech➢ Appropriate sense will be allocated to target word➢ Query expansion using most relevant words in

gloss definition

Page 8: Disambiguating Polysemous Queries For Document Retrieval

System Architecture

Page 9: Disambiguating Polysemous Queries For Document Retrieval

Tools Used

● WordNet● Solr Search Engine● WordNet Similarity Pakage● NLTK

Page 10: Disambiguating Polysemous Queries For Document Retrieval

WordNet

● What is WordNet?● Structure of WordNet● Relationships in WordNet:

➢ Hypernym, Hyponym➢ Meronym and Holonym ➢ Coordinate terms

Page 11: Disambiguating Polysemous Queries For Document Retrieval

WordNet Example..

● Consider a word 'car'● WordNet has 5 synsets for noun car● “ car, elevator car -- (where passengers ride

up and down; "the car was on the top floor")”● Relationship:

➢ Vehicle is hypernym and car is hyponym➢ Accelerator is meronym of car and car is holonym

of accelerator➢ Car and bike are coordinated terms

Page 12: Disambiguating Polysemous Queries For Document Retrieval

Solr Search Engine

Page 13: Disambiguating Polysemous Queries For Document Retrieval

WordNet-Similarity and NLTK

● WordNet-Similarity: An open source package➢ Provides similarity measures using different

methods

● NLTK: Set of libraries and programs for natural language processing ➢ POS tagging ➢ WordNet through NLTK

Page 14: Disambiguating Polysemous Queries For Document Retrieval

Lesk Algorithm● Classical algorithm for WSD● Consider an example “pine cone”● PINE

1. kinds of evergreen tree with needle-shaped leaves

2. waste away through sorrow or illness● CONE

1. solid body which narrows to a point

2. something of this shape whether solid or hollow

3. fruit of certain evergreen trees● Pine#1 ∩ cone#3 = 2

Page 15: Disambiguating Polysemous Queries For Document Retrieval

Extended Lesk Algorithm

● Proposed by Satanjeev Banerjee and Ted Pedersen

● Based on extended gloss overlap● Considering overlap between gloss input

synsets and gloss of hypernym, hyponym, meronym, holonym and troponym of input synsets

Page 16: Disambiguating Polysemous Queries For Document Retrieval

Scoring mechanism● For overlap of n words add value n^2 to the

score● Consider gloss definition of drawing paper and

decal● Drawing paper : paper that is specially

prepared for use in drafting● Decal : the art of transferring designs from

specially prepared paper to a wood or glass or metal surface

Page 17: Disambiguating Polysemous Queries For Document Retrieval

Continued...

● There are three words that overlap, paper and the two word phrase specially prepared.

● Hence score will be (1 + 4 =) 5

Page 18: Disambiguating Polysemous Queries For Document Retrieval

Computing Relatedness● Consider set of relations is

RELS = {gloss, hype, hypo}● relatedness measure between synset A and B

is computed as:

next slide...

Page 19: Disambiguating Polysemous Queries For Document Retrieval

relatedness(A,B) = score(gloss(A),gloss(B)) + score(hype(A),hype(B)) + score(hypo(A),hypo(B)) + score(hype(A),gloss(B)) +

score(gloss(A),hype(b))

e.g relatedness between temple#n#1 and build#v#1 using extended Lesk is 33 and with

simple Lesk it is 0

Continued...

Page 20: Disambiguating Polysemous Queries For Document Retrieval

Implementation:

● Query Preprocessing● Polysemy Detection● Disambiguation● Query Expansion● Document Retrieval

Page 21: Disambiguating Polysemous Queries For Document Retrieval

● Input query for demonstration:

“Bark of Pine is tough!”

Page 22: Disambiguating Polysemous Queries For Document Retrieval

Query Preprocessing:● Query refining● Compoundify

e.g. White House = White_House● POS tagging

[('bark', 'n'), ('of', None), ('pine', 'n'), ('is', 'v'), ('tough', 'a')]

● Stop word removal

[('bark', 'n'), ('pine', 'n'), ('tough', 'a')]

Page 23: Disambiguating Polysemous Queries For Document Retrieval

Polysemy Detection● Normal method based on number of synset

1. (39) United States, United States of America, America, the States, US, U.S., USA, U.S.A. -- (North American republic containing 50 states - 48 conterminous states in North America plus Alaska in northwest North America and the Hawaiian Islands in the Pacific Ocean; achieved independence in 1776)

2. (1) America -- (North America and South America and Central America)

● Use of relatedness concept for polysemy detection

Polysemy Words in input query are:

bark and tough

Page 24: Disambiguating Polysemous Queries For Document Retrieval

Disambiguation Process● Construction of window

window size = 4

bark = {pine,tough}

tough = {bark,pine}● Disambiguating individual polysemous word

using it's window words● Sense labled:

bark#n#01 and tough#a#07

Page 25: Disambiguating Polysemous Queries For Document Retrieval

Query Expansion● Use of gloss definitions● Combine gloss definitions and input query● “bark of pine is tough tough protective

covering of the woody stems and roots of trees and other woody plants resistant to cutting or chewing”

● Remove noise and duplicate words● Clustering based on relatedness● Output: “bark pine tough roots stems trees

covering plants cutting chewing

Page 26: Disambiguating Polysemous Queries For Document Retrieval

Document retrieval● Expanded query as an input to solr search

engine● Documents with maximum matching words will

be retrived first ● Testing and Evaluation

Page 27: Disambiguating Polysemous Queries For Document Retrieval

Performance Evaluation

1. Disambiguation Performance

2. Retrieval Performance

Page 28: Disambiguating Polysemous Queries For Document Retrieval

● Disambiguation is centre of the Entire model

● Effective Retrieval is result of efficient disambiguation

● Over 75% successful disambiguation.

● Table of disambiguated queries

Disambiguation Performance

Page 29: Disambiguating Polysemous Queries For Document Retrieval
Page 30: Disambiguating Polysemous Queries For Document Retrieval
Page 31: Disambiguating Polysemous Queries For Document Retrieval

Retrieval Performance

Page 32: Disambiguating Polysemous Queries For Document Retrieval

1.Average Performance rating● Unknown nature of documents

● Implies performance evaluation over heterogeneous data.

● 5 completely relevant 3 moderately relevant 1 Less relevant 0 Irrelevant

Page 33: Disambiguating Polysemous Queries For Document Retrieval
Page 34: Disambiguating Polysemous Queries For Document Retrieval

2. Standard Performance Measures

1 True Positive2 False Positive3True Negative4 False Negative

● Importance of these Measures● Inference derived from the values

Page 35: Disambiguating Polysemous Queries For Document Retrieval
Page 36: Disambiguating Polysemous Queries For Document Retrieval

3. Performance over increasing collection size● Effect of increasing size of collection

● The applications of this model needs it to perform consistently over changing size

● Behavior of the results over changing size :

Page 37: Disambiguating Polysemous Queries For Document Retrieval
Page 38: Disambiguating Polysemous Queries For Document Retrieval

● Acceptable Disambiguation performance

● Effective Query expansion

● Refinement in the result in terms of top ranked documents

● Consistency over increasing collection

Conclusion

Page 39: Disambiguating Polysemous Queries For Document Retrieval

Issues

● Requires a grammatically well formed query

● Limitations of Lesk can not be avoided

● The performance over entire set of retrieved documents

Page 40: Disambiguating Polysemous Queries For Document Retrieval

Future Direction

1. Domain assignment

2. Clustering of terms

3. Ranking of documents

Page 41: Disambiguating Polysemous Queries For Document Retrieval

Publications

● Submitted a literature survey, 'Disambiguating Polysemous Queries for Document Retrieval' in 'International Journal of Engineering,Science and Innovative Technology (IJESIT)'

● Designed a model 'Disambiguating Polysemous Queries for Document Retrieval', expecting to publish it soon

Page 42: Disambiguating Polysemous Queries For Document Retrieval

Bibliography

“Homonymy and polysemy in information retrieval” by Robert Krovetz

“Words Polysemy Analysis: Implementation of the Word Sense Disambiguation Algorithm Based On Magnini Domains” by Francis C. Fernandez-Reyes, Exiquio C. Leyva Perez, Rogelio Lau Fernandez

“Unsupervised Word Sense Disambiguation Rivaling Supervised Methods” by David Yarowsky

“Using Bilingual Materials to Develop Word Sense Disambiguation Methods” by William A. Gale , Kenneth W. Church , David Yarowsky

“A Proposal for Word Sense Disambiguation using Conceptual Distance” by Eneko Agirre and German Rigau

”WORD SENSE DISAMBIGUATION USING ID TAGS – IDENTIFYING MEANING IN POLYSEMOUS WORDS IN ENGLISH” by Nikola Dobrić , Alpen-Adria Universität Klagenfurt

WordNet: http://wordnet.princeton.edu/wordnet/documentation/

“Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora ” by David Yarowsky

Semantic Similarity: http://en.wikipedia.org/wiki/Semantic_similarity

”Semantic distance inWordNet:An experimental, application-oriented evaluation of five measures”, Alexander Budanitsky and Graeme Hirst

Rada R., Mili H., Bicknell E. and Blettner M., Development an Application of a Metric on Semantic Nets, in IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 1, 17-30. 1989

"Integrating Subject Field Codes into WordNet" by B. Magnini y G. Cavaglia

”An Extended Analysis of a Method of All Words Sense Disambiguation” by Varada Kolhatkar

“Word sense disambiguation using WordNet and the Lesk algorithm ” by Jonas EKEDAHL and Koraljka GOLUB

Page 43: Disambiguating Polysemous Queries For Document Retrieval

●“Extended Gloss Overlaps as a Measure of Semantic Relatedness ” by Ted Pedersen and Sanjeev Banerjee●“A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge” by Gang Liu, Ruili Wang, Jeremy Buckley, Helen M. Zhou ●“Semantic similarity based on corpus statistics and lexical taxonomy” by J. Jiang and D. Conrath. ●“Using corpus statistics and WordNet relations for senseidentification” by C. Leacock, M. Chodorow, and G. Miller●“Verb semantics and lexical selection” by Z. Wu and M. Palmer●“Lexical chains as representations of context for the detection and correction ofmalapropisms” by G. Hirst and D. St-Onge●P. Resnik, WordNet and class–based probabilities. In C. Fellbaum, editor, WordNet: An electroniclexical database, pages 239–263. MIT Press, 1998●Dekang Lin. Automatic retrieval and clustering of similar words. In COLING-ACL, pages 768–774,1998●Miller, G: Special Issue, WordNet: An on-line lexical database. International Journal of Lexicography, 3(4) (1990)●WordNet: http://en.wikipedia.org/wiki/Wordnet ●P. D. Turney, “Similarity of semantic relations,” Comput. Linguist., 32(3), pp.379-416, 2006●Yarowsky, David, "One Sense Per Collocation," in Proceedings, ARPA Human Language Technology Workshop, Princeton, 1993. ●Yarowsky, David, "Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French," in Proceedings of the 32nd Annual Meeting of the Association .for Computational Linguistics, Las Cruces, NM, 1994.