41
CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT Bombay

CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Embed Size (px)

Citation preview

Page 1: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

CS626/449 : Speech, NLP and the Web/Topics in AI Programming

(Lecture 5: Wordnet; Application in Query Expansion)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

Page 2: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Lexical Matrix

Page 3: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Wordnet - Lexical Matrix (with examples)

Word MeaningsWord Forms

F1 F2 F3 … Fn

M1

(depend)E1,1

(bank)E1,2

(rely)E1,3

M2

(bank)E2,2

(embankment)

E2,…

M3

(bank)E3,2 E3,3

… …

Mm Em,n

Page 4: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Psycholinguistic Theory • Human lexical memory for nouns as a hierarchy.• Can canary sing? - Pretty fast response.• Can canary fly? - Slower response.• Does canary have skin? – Slowest response.

(can move, has skin)

(can fly)

(can sing)

Wordnet - a lexical reference system based on psycholinguistic theories of human lexical memory.

Animal

Bird

canary

Page 5: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Hindi Wordnet

Dravidian Language Wordnets

North East Language Wordnet

Marathi Wordnet

Sanskrit Wordnet

EnglishWordnet

Bengali Wordnet

Punjabi Wordnet

KonkaniWordnet

Linked Wordnets in India

Page 6: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Semantic relations in wordnet1. Synonymy2. Hypernymy / Hyponymy3. Antonymy4. Meronymy / Holonymy5. Gradation6. Entailment 7. Troponymy1, 3 and 5 are lexical (word to word), rest are semantic

(synset to synset).

Page 7: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Synset: the foundation(house)

1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house")2. house -- (an official assembly having legislative powers; "the legislature has two houses")3. house -- (a building in which something is sheltered or located; "they had a large carriage house")4. family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be presented; "the house was full")6. firm, house, business firm -- (members of a business organization that owns or operates one or more establishments; "he worked for a brokerage house")7. house -- (aristocratic family line; "the House of York")8. house -- (the members of a religious community living together)9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he counted the house")10. house -- (play in which children take the roles of father or mother or children and pretend to interact like adults; "the children were playing house")11. sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided)12. house -- (the management of a gambling house or casino; "the house gets a percentage of every bet")

Page 8: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Synset - DSF format (2/2)

ID :: 121CATEGORY :: NOUNCONCEPT :: अपने� से� छो�टों के� प्रति� हृदय में� उठने�वा�ला�

प्र�मेंEXAMPLE :: “चा�चा� ने�हरू के� बच्चों से� बहु� ह� स्ने�ह

था�”SYNSET :: स्ने�ह,ने�ह,लागा�वा,मेंमें��

Page 9: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Creation of Synsets

Three principles:• Minimality• Coverage• Replacability

Page 10: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Gloss and ExampleCrucially needed for concept explication, wordnet building using another

wordnet and wordnet linking.

{earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

Page 11: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Semantic Relations

• Hypernymy and Hyponymy– Relation between word senses (synsets)– X is a hyponym of Y if X is a kind of Y– Hyponymy is transitive and asymmetrical– Hypernymy is inverse of Hyponymy

(lion->animal->animate entity->entity)(lion->animal->animate entity->entity)

Page 12: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Semantic Relations (continued)

• Meronymy and Holonymy– Part-whole relation, branch is a part of tree– X is a meronymy of Y if X is a part of Y– Holonymy is the inverse relation of Meronymy{kitchen} ………………………. {house}

Page 13: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Lexical Relation

• Antonymy– Oppositeness in meaning – Relation between word forms– Often determined by phonetics, word length etc.

({rise, ascend} vs. {fall, descend})

Page 14: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT
Page 15: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Gradation

StateState Childhood, Youth, Old Childhood, Youth, Old ageage

TemperatureTemperature Hot, Warm, ColdHot, Warm, Cold

ActionAction Sleep, Doze, WakeSleep, Doze, Wake

Page 16: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Gloss

study

Hyponymy

Hyponymy

Dwelling,abode

bedroom

kitchen

house,home

A place that serves as the living quarters of one or mor efamilies

guestroom

veranda

bckyard

hermitage cottage

Meronymy

Hyponymy

Meronymy

Hypernymy

WordNet Sub-Graph (English)

Page 17: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

गा�य, गाऊ (gaaya ,gauu) Cow

चा$प�य�,पशु'(chaupaayaa, pashu)Four-legged animal

से(गावा�ला� एके शु�के�ह�री� में�द� चा$प�य�(siingwaalaa eka sakaahaarii maadaa choupaayaa)A horny, herbivorous, four-legged female animal)

पगा'री�ने� ( paguraanaa) ruminate

ब+ला (baila) Ox

के�मेंधे�ने'kaamadhenuA kind of cow

में+ने� गा�यmainii gaayaA kind of cow

थाने (thana) udder

प-.छो(puunchh ) Tail

शु�के�ह�री� (shaakaahaarii) herbivorous

Hypernym

Attribute

Hyponym

Gloss

Ability Verb

meronym

Antonym

WordNet Sub-Graph: Hindi

Page 18: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Wordnet Subgraph (Marathi)

खो�ड

री�ने

ब�गा

आं2ब�लिं45ब-

में -ूळ

में'ळ� ,खो�ड,फां�2द्या�,प�ने� इत्य�द<ने� य'क्त असे� वानेस्पति�तिवाशु�ष:"झा�ड� पय�Aवारीण शु'द्ध केरीण्य�चा� के�में केरी���"

झा�ड, वाEक्ष, �रू

वानेस्प��

MERONYMY

HOLONYMY

H Y P E R N Y M Y

H Y P O N Y M YGLOSS

Page 19: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Pan-India Dictionary StandardSenses Hindi Marathi Bangali Oriya Tamil

(W1, W2, W3, W4, W5, W6 )

(W1, W2, W3, W4, W5, W6 )

(W1, W2, W3) (W1, W2 , W3) (W1, W2, W3, W4)

(W1, W2, W3)

(sun) (सू�र्य�, सू�रज, भा�नु, भा�स्कर, प्रभा�कर,

दि�नुकर, अं�शुमा�नु, अं�शुमा�ली�) (सू�र्य�, भा�नु, दि�वा�कर, भा�स्कर, रविवा, दि�नु�शु, दि�नुमाणी�) ... ... ...

(cub, lad, laddie, sonny, sonny boy)

(लीड़क�, बा�लीक, बाच्चा�, छो�कड़�, छो�र�, छो�कर�, ली�डा� )

(मालीगा�, पो�रगा�, पो�र, पो�रगा� ) … … …

(son, boy) (पोत्र, बा�टा�, लीड़क�, ली�ली, सूत, बाच्चा�, नु��नु, पो�त, चि'र�ज�वा, चि'र�ज� )

(मालीगा�, पोत्र, ली�क, चि'र�ज�वा, तनुर्य ) … … …

Page 20: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Sanskrit Wordnet: a new effort- A column in the Concept based Multilingual dictionary

Concepts L1 (English) L2 (Hindi) L3 (Sanskrit)

Concept ID: Concept description

(W1, W2, W3, ..) (W4, W5, W6, ..) (W7, W8, W9, ..)

4066: any of various long-tailed primates (excluding the prosimians)

(monkey)(ब2दरी, बन्दरी, ब�नेरी,

वा�नेरी, केHशु, केतिप, मेंकेA टों, ..)

(वा�नेरीI, केतिपI, प्लावाङ्गःI, प्लावागाI, शु�खो�मेंEगाI, वाला�में'खोI, मेंकेA टोंI, ..)

2186: a typical star that is the source of light and heat for the planets in the solar system

(sun)

(से-यA,से-रीज, भा�ने', दिदवा�केरी, भा�स्केरी, प्रभा�केरी, दिदनेकेरी, रीतिवा, ..)

(से-यAI, सेतिवा��, आंदिदत्यI, मिमेंत्रःI, अरुणI, भा�ने'I, प-ष�, अकेA I, ..)

Page 21: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Query Expansion

Acknowledgement: part of the slides borrowed from my Dual Degree Project Student Nishikant Dhanuka

Page 22: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Problem with Keywords

• May not retrieve relevant documents that include synonymous terms– “restaurant” vs. “cafe”– “India” vs. “Bharat”

• May retrieve irrelevant documents that include ambiguous terms– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

Page 23: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Why Search Engines Fail to Search Relevant Documents

Page 24: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Query Expansion

Definition• adding more terms (keyword spices) to a

user’s basic query Goal

• to improve Precision and/or RecallExample

• User Query: car• Expanded Query: car, cars, automobile,

automobiles, auto, .. etc

Page 25: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Naïve Methods

• Finding synonyms of query terms and searching for synonyms as well

• Finding various morphological forms of words by stemming each word in the query

• Fixing spelling errors and automatically searching for the corrected form

• Re-weighting the terms in original query

Page 26: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Query Expansion Issues

• Two major issues –– Which terms to include?– Which terms to weight more?

• Concept based versus term based QE– Is it better to expand based upon the individual

terms in the query, or the overall concept of the query

Page 27: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Objective

To get proper set of words, which will improve Precision, when added to basic search query, without loosing the recall in considerable amount

Page 28: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Existing QE techniques

• Global methods (static; of all documents in collection)– Query expansion

• Thesauri (or WordNet)• Automatic thesaurus generation

• Local methods (dynamic; analysis of documents in result set)– Relevance feedback– Pseudo relevance feedback

Page 29: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Global Analysis

Page 30: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Thesaurus based QE• For each term, t, in a query, expand the query with synonyms and

related words of t from the thesaurus– feline → feline cat

• May weight added terms less than original query terms.• Generally increases recall.• May significantly decrease precision, particularly with ambiguous

terms.– “interest rate” “interest rate fascinate evaluate”

• There is a high cost of manually producing a thesaurus– And for updating it for scientific changes

Page 31: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Automatically Generated Thesauri

• Attempt to generate a thesaurus automatically by analyzing the collection of documents

• Two main approaches– Co-occurrence based (co-occurring words are more likely

to be similar)– Shallow analysis of grammatical relations

• Entities that are grown, cooked, eaten, and digested are more likely to be food items.

• Co-occurrence based is more robust, grammatical relations are more accurate.

Page 32: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Example

Page 33: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Semantic Network/ Wordnet

• To expand a query, find the word in the semantic network and follow the various arcs to other related words.

Page 34: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Global Methods: Summary

• Pros – Thesauri and Semantic Networks (WordNet) can be

used to find good words for users “more like this”

• Cons– Little improvement has been found with automatic

techniques to expand query without user intervention

– Overall, not as useful as Relevance Feedback, may be as good as Pseudo Relevance Feedback

Page 35: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Local Analysis

Page 36: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Relevance Feedback

• Relevance feedback: user feedback on relevance of docs in initial set of results– User issues a (short, simple) query– The user marks returned documents as relevant

or non-relevant.– The system computes a better representation of

the information need based on feedback.– Relevance feedback can go through one or more

iterations.

Page 37: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Relevance Feedback Example: Initial Query and Top 8 Results

• Query: New space satellite applications

• + 1. 0.539, 08/13/91, NASA Hasn't Scrapped Imaging Spectrometer• + 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite

Plan• 3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges

Launches of Smaller Probes• 4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible

Feat: Staying Within Budget• 5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes

Satellites for Climate Research• 6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big

Satellites to Study Climate• 7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From

Telesat Canada• + 8. 0.509, 12/02/87, Telecommunications Tale of Two Companies

Page 38: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Relevance Feedback Example: Expanded Query

• 2.074 new 15.106 space• 30.816 satellite 5.660 application• 5.991 nasa 5.196 eos• 4.196 launch 3.972 aster• 3.516 instrument 3.446 arianespace• 3.004 bundespost 2.806 ss• 2.790 rocket 2.053 scientist• 2.003 broadcast 1.172 earth• 0.836 oil 0.646 measure

Page 39: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Top 8 Results After Relevance Feedback

• + 1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan

• + 2. 0.500, 08/13/91, NASA Hasn't Scrapped Imaging Spectrometer• 3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite,

Space Sleuths Do Some Spy Work of Their Own• 4. 0.493, 07/31/89, NASA Uses 'Warm‘ Superconductors For Fast Circuit• + 5. 0.492, 12/02/87, Telecommunications Tale of Two Companies• 6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile For

Commercial Use• 7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the

Soviets In Rocket Launchers• 8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90

Million

Page 40: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Relevance Feedback: Problems

Why do most search engines not use relevance feedback?

• Users are often reluctant to provide explicit feedback

• It’s often harder to understand why a particular document was retrieved after applying relevance feedback

Page 41: CS626/449 : Speech, NLP and the Web/Topics in AI Programming (Lecture 5: Wordnet; Application in Query Expansion) Pushpak Bhattacharyya CSE Dept., IIT

Pseudo Relevance Feedback

• Automatic local analysis• Pseudo relevance feedback attempts to

automate the manual part of relevance feedback.

• Retrieve an initial set of relevant documents.• Assume that top m ranked documents are

relevant.• Do relevance feedback