35
WIKIPEDIA AS SENCE INVENTORY TO IMPROVE DIVERSITY IN WEB SEARCH RESULTS Celina Santamar´ıa, Julio Gonzalo and Javier Artiles UNED, c/Juan del Rosal, 16, 28040 Madrid, Spain (National University of Distance Education) ACL 2010

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

  • Upload
    caroun

  • View
    17

  • Download
    0

Embed Size (px)

DESCRIPTION

Wikipedia as Sence Inventory to Improve Diversity in Web Search Results. Celina Santamar´ıa, Julio Gonzalo and Javier Artiles UNED, c/Juan del Rosal, 16, 28040 Madrid, Spain ( National University of Distance Education ) ACL 2010. Motivation. Motivation Test Set Set of Words - PowerPoint PPT Presentation

Citation preview

Page 1: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

WIKIPEDIA AS SENCE INVENTORY TO IMPROVE DIVERSITY IN WEB SEARCH RESULTSCelina Santamar´ıa, Julio Gonzalo and Javier ArtilesUNED, c/Juan del Rosal, 16, 28040 Madrid, Spain(National University of Distance Education)ACL 2010

Page 2: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

MOTIVATION

Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 3: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

FOR VERY SHORT QUERIES

for very short queries one word disambiguation may not be possible

focus on two broad-coverage lexical resources WordNet Wikipedia

Page 4: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TEST SET

Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 5: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TEST SET

Set of Words Set of Documents Manual Annotation

Page 6: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

SET OF WORDS

Corpus annotation two annotator handle 40 nouns 

15 nouns from the Senseval-3 lexical sample dataset 25 additional words which satisfy two conditions:

they are all ambiguous, and they are all names for music bands in one of their

senses

Page 7: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

CORPUS

The Senseval set is : {argument, arm, atmosphere, bank, degree,

difference, disc, image, paper, party, performance, plan, shelter, sort, source}.

The bands set is : {amazon, apple, camel, cell, columbia, cream,

foreigner, fox, genesis, jaguar, oasis, pioneer, police, puma, rainbow, shell, skin, sun, tesla, thunder, total, traffic, trapeze, triumph, yes}

Page 8: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TABLE 1: COVERAGE OF SEARCH RESULTS: WIKIPEDIA VS. WORDNET

For each noun in set, we looked up all its possible senses in WordNet 3.0 and in Wikipedia disambiguation pages

Wikipedia has an average of 22 senses (per noun) 25.2 in the Bands set 16.1 in the Senseval set

Wordnet a much smaller figure, 4.5 senses (per noun) 3.12 for the Bands set 6.13 for the Senseval set

Page 9: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

SET OF DOCUMENTS

Step 1: retrieved top 150 (per noun) in google

Step 2: for each document, we stored both the snippet

and whole HTML document assume a ”one sense per document”

Page 10: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

MANUAL ANNOTATION

Annotation Two annotators for every document, whether there was

appropriate senses in each of the dictionaries. They provide annotations for 100 documents

per noun If an URL in the list was corrupt or not available,

it had to be discarded 150 -> 100 documents per noun

Page 11: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 12: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

THE TOP TEN RESULT ARE NOT COVER BY WIKIPEDIA 32% of top ten document are not cover by

wikipedia manually examined

a majority of the missing senses consists of names of (generally not well-known) companies (45%) products or services (26%);

The other frequent type (12%) of non annotated document is disambiguation pages

Page 13: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

DEGREE OF OVERLAP BETWEEN WIKIPEDIA AND WORNNET SENSES

just 3% fit wordnet only. Wikipedia seems to extend the coverage of

Wordnet

Page 14: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Abstract Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 15: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

diversity is not a major priority for ranking results the top ten results only cover, in average, 3

Wikipedia senses average number of senses listed in Wikipedia is 22

First 100 documents, this number grows up to 6.85 senses per noun.

Average 63% of the pages in search results belong to the most frequent sense of the query word

Page 16: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 17: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

SENSE FREQUENCY ESTIMATORS FOR WIKIPEDIA Wikipedia disambiguation don’t contain the

relative importance of senses for a given word. Internal relevance

incoming links for the URL of a given sense in Wikipedia.

stable External relevance

number of visits for the URL of a given sense (as reported in http://stats.grok.se).

Not stable

Page 18: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

MEASURED CORRELATION

for each noun w and for each sense wi, we consider three values: proportion of documents retrieved for w which

are manually assigned to each sense inlinks(wi):

Relative amount of incoming links to each sense wi

visits(wi): relative number of visits to the URL for each sense wi.

Page 19: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

MEASURED CORRELATION

We have measured the correlation between these three values using a linear regression correlation coefficient, correlation value of .54 for the number of visits correlation value of .71 for the number of

incoming links. Both estimators seem to be positively correlated

Page 20: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

MEASURED CORRELATION

freq(wi) = k * inlinks(wi) + (1 – k) * visits(wi), k = 0, 0.1, 0.2, …, 1

When k is 0.9 , the function have maximal correlation value of .73 freq(wi) = 0.9 * inlinks(wi) + 0.1 * visits(wi)

This weighted estimator provides a slight advantage over the use of incoming links only (0.73 vs 0.71)

Page 21: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 22: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TWO DIFFERENT TECHNIQUES

Two different techniques Vector Space Model (VSM) WSD system

Two baselines random assignment of senses most frequent sense

Page 23: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 24: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

VSM APPROACH

For each word sense, its Wikipedia page in a (unigram) vector space model

idf weights are computed in two different ways VSM :

IDF in the collection of retrieved documents VSM-GT:

uses the statistics provided by the Google Terabyte collection

VSM-mixed: VSM + VSM-GT

Page 25: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

VSM APPROACH

cosine similarity Assign the sense with the highest similarity

to the document In case of ties, pick the first sense in the

Wikipedia disambiguation page VSM-GT+freq

Consider the case of ties we pick up the one which has the largest frequency

according to our estimator

Page 26: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

WSD APPROACH

TiMBL a state-of-the-art supervised WSD system uses Memory-Based Learning. TiMBL-core

Occurrences of the word in the Wikipedia page for the word sense.

TiMBL-inlinks occurrences of the word in Wikipedia pages pointing to

the page for the word sense. TiMBL-all

Core + inlinks

Page 27: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TIMBL

first : disambiguate all occurrences of word w in the page p.

Then : we choose the sense which appears most frequently in

the page according to TiMBL results. In case of ties :

pick up the first sense listed in the Wikipedia disambiguation page.

TiMBL-core+freq Consider the case of ties

we pick up the sense with the highest frequency according to our estimator

when no sense reaches 30% of the cases in the page to be disambiguated we also resort to the most frequent sense heuristic

Page 28: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

TABLE 4:

Precision: the number of

pages correctly classified divided by the total number of predictions.

Page 29: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results
Page 30: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

Page 31: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

USING CLASSIFICATION TO PROMOTE DIVERSITY we fill each position in the rank (starting at

rank 1), with the document which has the highest similarity to some of the senses which are not yet represented in the rank;

Page 32: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

ALTERNATIVE RANKING FOR COMPARISON

clustering (centroids): this method applies Hierarchical Agglomerative

Clustering clustering (top ranked):

this time the top ranked document (in the original Google rank) of each cluster is selected.

random: Randomly selects ten documents from the set of

retrieved results. upper bound:

coverage is not 100% because some words have more than ten meanings in

Wikipedia and we are only considering the top ten documents.

Page 33: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

Coverage : number of senses in top 10 / number of senses in all

result Coverage of senses going from 49% to 77% the coverage of Wikipedia senses in the top ten

results is 70% larger than in the original ranking

Page 34: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

Using Wikipedia to enhance diversity seems to work much better than clustering

bias only Wikipedia senses are considered to estimate

diversity. our results do not imply that the Wikipedia

modified rank is better than the original Google rank.

Wikipedia can be used as a reference to improve search results diversity for one-word queries.

Page 35: Wikipedia as  Sence  Inventory to Improve Diversity in Web Search Results

CONCLUSIONS

We have investigated whether generic lexical resources can be used to promote diversity in Web search results for one-word, ambiguous queries. We have compared WordNet and Wikipedia (i) unsurprisingly, Wikipedia has a much better

coverage of senses in search results, and is therefore more appropriate for the task;

(ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia

(iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings.