19
Improving Similarity Measures for Short Segments of Text Scott Wen-tau Yih & Chris Meek Microsoft Research

Improving Similarity Measures for Short Segments of Text

Embed Size (px)

DESCRIPTION

Improving Similarity Measures for Short Segments of Text. Scott Wen -tau Yih & Chris Meek Microsoft Research. query mariners. Query Suggestion. How similar are they? mariners vs. seattle mariners mariners vs. 1st mariner bank. Keyword Expansion for Online Ads. Chocolate Cigarettes. - PowerPoint PPT Presentation

Citation preview

Page 1: Improving Similarity Measures for Short Segments of Text

Improving Similarity Measures for Short Segments of Text

Scott Wen-tau Yih & Chris MeekMicrosoft Research

Page 2: Improving Similarity Measures for Short Segments of Text

Query Suggestion

How similar are they?

mariners vs. seattle marinersmariners vs. 1st mariner bank

querymariners

Page 3: Improving Similarity Measures for Short Segments of Text

Keyword Expansion for Online Ads

Chocolate Cigarettes

Chocolate candy

Chocolate cigars

Nostalgic candy

Novelty candy

Candy cigarettes

Old fashioned candy

How similar are they?

chocolate cigarettes vs. cigaretteschocolate cigarettes vs. chocolate

cigarschocolate cigarettes vs. old

fashioned candy

Page 4: Improving Similarity Measures for Short Segments of Text

Measuring SimilarityGoal: create a similarity function

fsim: (String1,String2) R

Rank suggestions Fix String1 as q; vary String2 as s1 , s2 , , sk

Whether the function is symmetric is not important

For query suggestion – fsim(q,s)

fsim(“mariners”, “seattle mariners”) = 0.9

fsim(“mariners”, “1st mariner bank”) = 0.6

Page 5: Improving Similarity Measures for Short Segments of Text

Enabling Useful ApplicationsWeb search

Ranking query suggestionsSegmenting web sessions using query logs

Online advertisingSuggesting alternative keywords to advertisersMatching similar keywords to show ads

Document writingProviding alternative phrasingCorrecting spelling errors

Page 6: Improving Similarity Measures for Short Segments of Text

ChallengesShort text segments may not overlap

“Microsoft Research” vs. “MSR” 0 cosine score

Ambiguous terms“Bill Gates” vs. “Utility Bill” 0.5 cosine score“taxi runway” vs. “taxi” 0.7 cosine score

Text segments may rarely co-occur in corpus

“Hyatt Vancouver” vs. “Haytt Vancover” 1 pageLonger query Fewer pages

Page 7: Improving Similarity Measures for Short Segments of Text

Our ContributionsWeb-relevance similarity measure

Represent the input text segments as real-valued term vectors using Web documentsImprove term weighting scheme based on relevant keyword extraction

Learning similarity measureFit user preference for the application betterCompare learning similarity function vs. learning ranking function

Page 8: Improving Similarity Measures for Short Segments of Text

OutlineIntroduction

Problem, Applications, Challenges

Our MethodsWeb-relevance similarity functionCombine similarity measures using learning

Learning similarity functionLearning ranking function

Experiments on query suggestion

Page 9: Improving Similarity Measures for Short Segments of Text

Web-relevance Similarity MeasureQuery expansion of x using a search

engineLet Dn(x) be the set of top n documents

Build a term vector vi for each document di Dn(x)

Elements are scores representing the relevancy of the words in document di

C(x) = 1n vi / ||vi|| (L2-normalized, centroid)QE(x) = C(x) / ||C(x)|| (L2-normalized)

Similarity score is simply the inner product

fsim (q,s) = QE(q) QE(s)

Page 10: Improving Similarity Measures for Short Segments of Text

Web-kernel SimilarityRelevancy = TFIDF [Sahami&Heilman ‘06]

Why TFIDF?High TF: important or relevant to the documentHigh DF: stopwords or words in template blocks

Crude estimate of the importance of the wordCan we do better than

TFIDF?

Page 11: Improving Similarity Measures for Short Segments of Text

Web-relevance SimilarityRelevancy = Prob(relevance | wj,di)

Keyword extraction can judge the importance of the words more accurately! [Yih et al. WWW-06] Assign relevancy scores (probabilities) to words/phrasesMachine Learning model learned by logistic regressionUse more than 10 categories of features

Query-log frequency High-DF words may be popular queries

The position of the word in the documentThe format, hyperlink, etc.

Page 12: Improving Similarity Measures for Short Segments of Text

Learning SimilaritySimilarity measures should depend on application

q=“Seattle Mariners” s1=“Seattle” s2=“Seattle Mariners Ticket”

Let human subjects decide what’s similar

Parametric similarity function fsim(q,s|w)Learn the parameter (weights) from dataUse Machine Learning to combine multiple base similarity measures

Page 13: Improving Similarity Measures for Short Segments of Text

Base Similarity MeasuresSurface matching methods

Suppose Q and S are the sets of words in a given pair of query q and suggestion s

Matching |QS|

Dice 2|QS|/(|Q|+|S|)

Jaccard |QS|/|QS|

Overlap |QS|/min(|Q|,|S|)

Cosine |QS|/sqrt(|Q|×|S|)

Corpus-based methodsWeb-relevance, Web-kernel, KL-divergence

Page 14: Improving Similarity Measures for Short Segments of Text

Learning Similarity FunctionData – pairs of query and suggestion (qi,sj)

Label: Relevance judgment (rel=1 or rel=0)

Features: Scores on (qi,sj) provided by multiple base similarity measures

We combine them using logistic regressionz = w1Cosine(q,s) + w2Dice(q,s) + w3Matching(q,s) +

w4Web-relevance(q,s) + w5KL-divergence(q,s) +

fsim(q,s|w) = Prob(rel|q,s;w) = exp(z)/(1+exp(z))

Page 15: Improving Similarity Measures for Short Segments of Text

Learning Ranking FunctionWe compare suggestions sj , sk to the same query q

Data – tuples of a query q and suggestions sj, sk

Label: [sim(q,sj) > sim(q,sk)] or [sim(q,sj) < sim(q,sk)]

Features: Scores on pairs (q,sj) and (q,sk) provided by multiple base similarity measures

Learn a probabilistic model using logistic regression

Prob([sim(q,sj) > sim(q,sk)] | q,sj,sk;w)

Page 16: Improving Similarity Measures for Short Segments of Text

ExperimentsData: Query suggestion dataset [Metzler et al. ’07]

|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

Results10-fold cross-validationEvaluation metrics: AUC and Precision@k

Query Suggestion Label

shell oil credit card shell gas cards Excellent

shell oil credit card texaco credit card Fair

tarrant county college

fresno city college Bad

tarrant county college

dallas county schools

Good

Page 17: Improving Similarity Measures for Short Segments of Text

AUC Scores

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.7390.735

0.703

0.691000000000001

0.664000000000001

0.6270.6270.626

0.6170.606

Page 18: Improving Similarity Measures for Short Segments of Text

Precision@3

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6

0.5690.556

0.5080.483

0.436

0.456

0.4560.456

0.4440.389

Page 19: Improving Similarity Measures for Short Segments of Text

ConclusionsWeb-relevance

New term-weighting scheme from keyword extractionOutperform existing methods on query suggestion

Learning similarityFit the application – better suggestion rankingLearning similarity function vs. learning ranking function

Future workExperiment with alternative combination methodsExplore other probabilistic models for similarityApply our similarity measures to different tasks