Improving Similarity Measures for Short Segments of Text

Improving Similarity Measures for Short Segments of Text

Scott Wen-tau Yih & Chris MeekMicrosoft Research

Query Suggestion

How similar are they?

mariners vs. seattle marinersmariners vs. 1st mariner bank

querymariners

Keyword Expansion for Online Ads

Chocolate Cigarettes

Chocolate candy

Chocolate cigars

Nostalgic candy

Novelty candy

Candy cigarettes

Old fashioned candy

How similar are they?

chocolate cigarettes vs. cigaretteschocolate cigarettes vs. chocolate

cigarschocolate cigarettes vs. old

fashioned candy

Measuring SimilarityGoal: create a similarity function

fsim: (String1,String2) R

Rank suggestions Fix String1 as q; vary String2 as s1 , s2 , , sk

Whether the function is symmetric is not important

For query suggestion – fsim(q,s)

fsim(“mariners”, “seattle mariners”) = 0.9

fsim(“mariners”, “1st mariner bank”) = 0.6

Enabling Useful ApplicationsWeb search

Ranking query suggestionsSegmenting web sessions using query logs

Online advertisingSuggesting alternative keywords to advertisersMatching similar keywords to show ads

Document writingProviding alternative phrasingCorrecting spelling errors

ChallengesShort text segments may not overlap

“Microsoft Research” vs. “MSR” 0 cosine score

Ambiguous terms“Bill Gates” vs. “Utility Bill” 0.5 cosine score“taxi runway” vs. “taxi” 0.7 cosine score

Text segments may rarely co-occur in corpus

“Hyatt Vancouver” vs. “Haytt Vancover” 1 pageLonger query Fewer pages

Our ContributionsWeb-relevance similarity measure

Represent the input text segments as real-valued term vectors using Web documentsImprove term weighting scheme based on relevant keyword extraction

Learning similarity measureFit user preference for the application betterCompare learning similarity function vs. learning ranking function

OutlineIntroduction

Problem, Applications, Challenges

Our MethodsWeb-relevance similarity functionCombine similarity measures using learning

Learning similarity functionLearning ranking function

Experiments on query suggestion

Web-relevance Similarity MeasureQuery expansion of x using a search

engineLet Dn(x) be the set of top n documents

Build a term vector vi for each document di Dn(x)

Elements are scores representing the relevancy of the words in document di

C(x) = 1n vi / ||vi|| (L2-normalized, centroid)QE(x) = C(x) / ||C(x)|| (L2-normalized)

Similarity score is simply the inner product

fsim (q,s) = QE(q) QE(s)

Web-kernel SimilarityRelevancy = TFIDF [Sahami&Heilman ‘06]

Why TFIDF?High TF: important or relevant to the documentHigh DF: stopwords or words in template blocks

Crude estimate of the importance of the wordCan we do better than

TFIDF?

Web-relevance SimilarityRelevancy = Prob(relevance | wj,di)

Keyword extraction can judge the importance of the words more accurately! [Yih et al. WWW-06] Assign relevancy scores (probabilities) to words/phrasesMachine Learning model learned by logistic regressionUse more than 10 categories of features

Query-log frequency High-DF words may be popular queries

The position of the word in the documentThe format, hyperlink, etc.

Learning SimilaritySimilarity measures should depend on application

q=“Seattle Mariners” s1=“Seattle” s2=“Seattle Mariners Ticket”

Let human subjects decide what’s similar

Parametric similarity function fsim(q,s|w)Learn the parameter (weights) from dataUse Machine Learning to combine multiple base similarity measures

Base Similarity MeasuresSurface matching methods

Suppose Q and S are the sets of words in a given pair of query q and suggestion s

Matching |QS|

Dice 2|QS|/(|Q|+|S|)

Jaccard |QS|/|QS|

Overlap |QS|/min(|Q|,|S|)

Cosine |QS|/sqrt(|Q|×|S|)

Corpus-based methodsWeb-relevance, Web-kernel, KL-divergence

Learning Similarity FunctionData – pairs of query and suggestion (qi,sj)

Label: Relevance judgment (rel=1 or rel=0)

Features: Scores on (qi,sj) provided by multiple base similarity measures

We combine them using logistic regressionz = w1Cosine(q,s) + w2Dice(q,s) + w3Matching(q,s) +

w4Web-relevance(q,s) + w5KL-divergence(q,s) +

fsim(q,s|w) = Prob(rel|q,s;w) = exp(z)/(1+exp(z))

Learning Ranking FunctionWe compare suggestions sj , sk to the same query q

Data – tuples of a query q and suggestions sj, sk

Label: [sim(q,sj) > sim(q,sk)] or [sim(q,sj) < sim(q,sk)]

Features: Scores on pairs (q,sj) and (q,sk) provided by multiple base similarity measures

Learn a probabilistic model using logistic regression

Prob([sim(q,sj) > sim(q,sk)] | q,sj,sk;w)

ExperimentsData: Query suggestion dataset [Metzler et al. ’07]

|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

Results10-fold cross-validationEvaluation metrics: AUC and Precision@k

Query Suggestion Label

shell oil credit card shell gas cards Excellent

shell oil credit card texaco credit card Fair

tarrant county college

fresno city college Bad

tarrant county college

dallas county schools

Good

AUC Scores

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.7390.735

0.703

0.691000000000001

0.664000000000001

0.6270.6270.626

0.6170.606

Precision@3

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6

0.5690.556

0.5080.483

0.436

0.456

0.4560.456

0.4440.389

ConclusionsWeb-relevance

New term-weighting scheme from keyword extractionOutperform existing methods on query suggestion

Learning similarityFit the application – better suggestion rankingLearning similarity function vs. learning ranking function

Future workExperiment with alternative combination methodsExplore other probabilistic models for similarityApply our similarity measures to different tasks

Documents

Improving Similarity Measures for Short Segments of Text