90
Modeling and Solving Term Mismatch for Full- Text Retrieval Le Zhao [email protected] School of Computer Science Carnegie Mellon University April 16, 2012 @Microsoft Research, Redmond 1

Modeling and Solving Term Mismatch for Full-Text Retrieval

  • Upload
    anais

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Modeling and Solving Term Mismatch for Full-Text Retrieval. Le Zhao [email protected] School of Computer Science Carnegie Mellon University April 16, 2012 @Microsoft Research, Redmond. What is Full-Text Retrieval. The task The Cranfield evaluation [ Cleverdon 1960] - PowerPoint PPT Presentation

Citation preview

Page 1: Modeling and Solving Term Mismatch for Full-Text Retrieval

1

Modeling and SolvingTerm Mismatch for Full-Text

Retrieval

Le [email protected]

School of Computer Science

Carnegie Mellon University

April 16, 2012@Microsoft Research, Redmond

Page 2: Modeling and Solving Term Mismatch for Full-Text Retrieval

2

What is Full-Text Retrieval

• The task

• The Cranfield evaluation [Cleverdon 1960]– abstracts away the user,– allows objective & automatic evaluations

User QueryRetrieval Engine

Document Collection

Results User

Page 3: Modeling and Solving Term Mismatch for Full-Text Retrieval

3

Where are We (Going)?

• Current retrieval models– formal models since 1970s, best ones 1990s– based on simple collection statistics (tf.idf),

no deep understanding of natural language texts

• Perfect retrieval– Query: “information retrieval”, A: “… text search …”

– Textual entailment (difficult natural language task)– Searcher frustration [Feild, Allan and Jones 2010]– Still far away, what have been holding us back?

imply

Page 4: Modeling and Solving Term Mismatch for Full-Text Retrieval

4

Two Long Standing Problems in Retrieval

• Term mismatch– [Furnas, Landauer, Gomez and Dumais 1987]– No clear definition in retrieval

• Query dependent term importance (P(t | R))– Traditionally, idf (rareness)– P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998]– Few clues about estimation

• This work– connects the two problems,– shows they can result in huge gains in retrieval,– and uses a predictive approach toward solving both

problems.

Page 5: Modeling and Solving Term Mismatch for Full-Text Retrieval

5

What is Term Mismatch & Why Care?

• Job search– You look for information retrieval jobs on the market.

They want text search skills.– cost you job opportunities, (50% even if you are careful)

• Legal discovery– You look for bribery or foul play in corporate documents.

They say grease, pay off.– cost you cases

• Patent/Publication search– cost businesses

• Medical record retrieval– cost lives

Le
Customize example to specific place.Desktop search, Win7Only 33% refinding queries are exact repeats (Teevan et al 2007)
Le
Lady gaga shocking picture
Page 6: Modeling and Solving Term Mismatch for Full-Text Retrieval

6

Prior Approaches

• Document:– Full text indexing

• Instead of only indexing key words– Stemming

• Include morphological variants– Document expansion

• Inlink anchor, user tags

• Query:– Query expansion, reformulation

• Both: – Latent Semantic Indexing– Translation based models

Page 7: Modeling and Solving Term Mismatch for Full-Text Retrieval

7

• Definition

• Significance (theory & practice)

• Mechanism (what causes the problem)

• Model and solution

Main Questions Answered

Page 8: Modeling and Solving Term Mismatch for Full-Text Retrieval

8

Collection

Definition of Mismatch P(t | Rq)

Directly calculated given relevance judgments for q

Relevant (q)

mismatch (P(t | Rq)) == 1 – term recall (P(t | Rq))_

_

“retrieval”

Jobs mismatched

Documents that contain t

All relevant jobs

Definition Importance Prediction Solution

Page 9: Modeling and Solving Term Mismatch for Full-Text Retrieval

9

(Example TREC-3 topics)

Term in Query

Oil Spills

Term limitations for US Congress members

Insurance Coverage which pays for Long Term Care

School Choice Voucher System and its effects on the US educational program

Vitamin the cure or cause of human ailments

P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071

How Often Do Terms Mismatch?

Definition Importance Prediction Solution

Page 10: Modeling and Solving Term Mismatch for Full-Text Retrieval

10

• Definition• P(t | R) or P(t | R), simple, • estimated from relevant documents, • analyze mismatch

• Significance (theory & practice)

• Mechanism (what causes the problem)

• Model and solution

Main Questions

_

Page 11: Modeling and Solving Term Mismatch for Full-Text Retrieval

11

Binary Independence Model– [Robertson and Spärck Jones 1976]– Optimal ranking score for each document d

– Term weight for Okapi BM25– Other advanced models behave similarly– Used as effective features in Web search engines

Term Mismatch &Probabilistic Retrieval Models

Idf (rareness)Term recall

Definition Importance: Theory Prediction Solution

Page 12: Modeling and Solving Term Mismatch for Full-Text Retrieval

12

Binary Independence Model– [Robertson and Spärck Jones 1976]– Optimal ranking score for each document d

– “Relevance Weight”, “Term Relevance”• P(t | R): only part about the query, & relevance

Term Mismatch &Probabilistic Retrieval Models

Definition Importance: Theory Prediction Solution

Term recall Idf (rareness)

Page 13: Modeling and Solving Term Mismatch for Full-Text Retrieval

13

• Definition

• Significance• Theory (as idf & only part about relevance)• Practice?

• Mechanism (what causes the problem)

• Model and solution

Main Questions

Page 14: Modeling and Solving Term Mismatch for Full-Text Retrieval

14

Binary Independence Model– [Robertson and Spärck Jones 1976]– Optimal ranking score for each document d

– “Relevance Weight”, “Term Relevance”• P(t | R): only part about the query, & relevance

Term Mismatch &Probabilistic Retrieval Models

Definition Importance: Practice: Mechanism Prediction Solution

Term recall Idf (rareness)

Page 15: Modeling and Solving Term Mismatch for Full-Text Retrieval

Without Term Recall

• The emphasis problem for idf-only term weighting– (Not only for BIM, but also tf.idf models)– Emphasize high idf (rare) terms in query

• “prognosis/viability of a political third party in U.S.” (Topic 206)

15

Definition Importance: Practice: Mechanism Prediction Solution

Page 16: Modeling and Solving Term Mismatch for Full-Text Retrieval

16

Ground Truth (Term Recall)

party political third viability prognosis

True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204

idf 2.402 2.513 2.187 5.017 7.471

Emphasis

Query: prognosis/viability of a political third party

Wrong Emphasis

Definition Importance: Practice: Mechanism Prediction Solution

Page 17: Modeling and Solving Term Mismatch for Full-Text Retrieval

17

Top Results (Language model)

1. … discouraging prognosis for 1991 …

2. … Politics … party … Robertson's viability as a candidate …

3. … political parties …

4. … there is no viable opposition …

5. … A third of the votes …

6. … politics … party … two thirds …

7. … third ranking political movement…

8. … political parties …

9. … prognosis for the Sunday school …

10. … third party provider …

All are false positives. Emphasis / Mismatch problem, not precision.

( , are doing better, but still have top 10 false positives.

Emphasis / Mismatch also a problem for large search engines!)

Definition Importance: Practice: Mechanism Prediction Solution

Query: prognosis/viability of a political third party

Page 18: Modeling and Solving Term Mismatch for Full-Text Retrieval

18

Without Term Recall

• The emphasis problem for idf-only term weighting– Emphasize high idf (rare) terms in query

• “prognosis/viability of a political third party in U.S.” (Topic 206)

– False positives throughout rank list• especially detrimental at top rank

– No term recall hurts precision at all recall levels

• How significant is the emphasis problem?

Definition Importance: Practice: Mechanism Prediction Solution

Page 19: Modeling and Solving Term Mismatch for Full-Text Retrieval

19

Emphasis 64%

Precision 9%

Failure Analysis of 44 Topics from TREC 6-8

RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)Failure analyses of retrieval models & techniques still standard today

Recall term weighting

Mismatch guided expansion

Basis: Term Mismatch Prediction

Definition Importance: Practice: Mechanism Prediction Solution

Mismatch 27%

Page 20: Modeling and Solving Term Mismatch for Full-Text Retrieval

20

• Definition

• Significance• Theory: as idf & only part about relevance• Practice: explains common failures,

other behavior: Personalization, WSD, structured

• Mechanism (what causes the problem)

• Model and solution

Main Questions

Page 21: Modeling and Solving Term Mismatch for Full-Text Retrieval

21

Emphasis 64%

Precision 9%

Failure Analysis of 44 Topics from TREC 6-8

RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

Recall term weighting

Mismatch guided expansion

Basis: Term Mismatch Prediction

Definition Importance: Practice: Potential Prediction Solution

Mismatch 27%

Page 22: Modeling and Solving Term Mismatch for Full-Text Retrieval

22

True Term Recall Effectiveness

• +100% over BIM (in precision at all recall levels)

– [Robertson and Spärk Jones 1976]

• +30-80% over Language Model, BM25 (in MAP)

– This work

• For a new query w/o relevance judgments, – Need to predict– Predictions don’t need to be very accurate

to show performance gain

Definition Importance: Practice: Potential Prediction Solution

Page 23: Modeling and Solving Term Mismatch for Full-Text Retrieval

23

• Definition

• Significance• Theory: as idf & only part about relevance• Practice: explains common failures, other behavior,• +30 to 80% potential from term weighting

• Mechanism (what causes the problem)

• Model and solution

Main Questions

Page 24: Modeling and Solving Term Mismatch for Full-Text Retrieval

24

(Examples from TREC 3 topics)

Term in Query

Oil Spills

Term limitations for US Congress members

Insurance Coverage which pays for Long Term Care

School Choice Voucher System and its effects on the US educational program

Vitamin the cure or cause of human ailments

P(t | R) 0.9914 0.9831 0.6885 0.2821 0.1071

How Often Do Terms Mismatch?

idf 5.201 2.010 2.010 1.647 6.405

Differs from idf

Definition Importance Prediction: Idea Solution

Varies 0 to 1

Same term, different Recall

Page 25: Modeling and Solving Term Mismatch for Full-Text Retrieval

25

Statistics

Term recall across all query terms (average ~55-60%)

TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall

stock

com

pute

cost to

y

vouc

hertak

en stop

fund

amen

talism

0

0.2

0.4

0.6

0.8

1Term Recall P(t | R)

0

0.2

0.4

0.6

0.8

1Term Recall P(t | R)

Definition Importance Prediction: Idea Solution

Page 26: Modeling and Solving Term Mismatch for Full-Text Retrieval

26

Statistics

Term recall on shorter queries (average ~70%)

TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall

slate

calif

orni

a

restr

ict

intel

lig...

freig

ht

pyra

mid lif

e0

0.10.20.30.40.50.60.70.80.9

1 Term Recall P(t | R)

00.10.20.30.40.50.60.70.80.9

1 Term Recall P(t | R)

Definition Importance Prediction: Idea Solution

Page 27: Modeling and Solving Term Mismatch for Full-Text Retrieval

27

Query dependent (but for many terms, variance is small)

Statistics

364 recurring words from TREC 3-7, 350 topics

Term Recall for Repeating Terms

Definition Importance Prediction: Idea Solution

Page 28: Modeling and Solving Term Mismatch for Full-Text Retrieval

P(t | R) vs. idf

28

P(t | R) vs. df/N (Greiff, 1998)

P(t | R)

df/N

-0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9 1.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1P(t | R)

idf

TREC 4 desc query terms

Definition Importance Prediction: Idea Solution

Page 29: Modeling and Solving Term Mismatch for Full-Text Retrieval

29

Prior Prediction Approaches

• Croft/Harper combination match (1979)– treats P(t | R) as a tuned constant, or estimated from PRF– when >0.5, rewards docs that match more query terms

• Greiff’s (1998) exploratory data analysis– Used idf to predict overall term weighting– Improved over basic BIM

• Metzler’s (2008) generalized idf– Used idf to predict P(t | R)– Improved over basic BIM

• Simple feature (idf), limited success– Missing piece: P(t | R) = term recall = 1 – term mismatch

Definition Importance Prediction: Idea Solution

Page 30: Modeling and Solving Term Mismatch for Full-Text Retrieval

30

What Factors can Cause Mismatch?

• Topic centrality (Is concept central to topic?)– “Laser research related or potentially related to defense”– “Welfare laws propounded as reforms”

• Synonyms (How often they replace original term?)– “retrieval” == “search” == …

• Abstractness– “Laser research … defense”

“Welfare laws”– “Prognosis/viability” (rare & abstract)

Definition Importance Prediction: Idea Solution

Page 31: Modeling and Solving Term Mismatch for Full-Text Retrieval

31

• Definition

• Significance

• Mechanism• Causes of mismatch: Unnecessary concepts,

replaced by synonyms or more specific terms

• Model and solution

Main Questions

Page 32: Modeling and Solving Term Mismatch for Full-Text Retrieval

32

Designing Features to Model the Factors

• We need to– Identify synonyms/searchonyms of a query term– in a query dependent way

• External resource? (WordNet, wiki, or query log)– Biased (coverage problem, collection independent)– Static (not query dependent)– Not easy, not used here

• Term-term similarity in concept space!– Local LSI (Latent Semantic Indexing)

• Top ranked documents (e.g. 200)• Dimension reduction (LSI keep e.g. 150 dimensions)

Definition Importance Prediction: Implement Solution

Page 33: Modeling and Solving Term Mismatch for Full-Text Retrieval

33

term limit

ballot

elect te

rm longnurse ca

re ail

health

disease

basler

0

0.1

0.2

0.3

0.4

0.5

Top Similar Terms

Similarity with query term

Synonyms from Local LSI

Term limitation for US Congress members

Insurance Coverage which pays for Long Term Care

Vitamin the cure or cause of human ailments

0.9831 0.6885 0.1071P(t | Rq)

Definition Importance Prediction: Implement Solution

Page 34: Modeling and Solving Term Mismatch for Full-Text Retrieval

34

term limit

ballot

elect te

rm longnurse ca

re ail

health

disease

basler

0

0.1

0.2

0.3

0.4

0.5

Top Similar Terms

Similarity with query term

Synonyms from Local LSI

Term limitation for US Congress members

Insurance Coverage which pays for Long Term Care

Vitamin the cure or cause of human ailments

0.9831 0.6885 0.1071

(1) Magnitude of self similarity – Term centrality

(2) Avg similarity of supporting terms – Concept centrality

(3) How likely synonyms replace term t in collection

P(t | Rq)

Definition Importance Prediction: Implement Solution

Page 35: Modeling and Solving Term Mismatch for Full-Text Retrieval

35

Features that Model the Factors

• Term centrality– Self-similarity (length of t) after dimension reduction

• Concept centrality– Avg similarity of supporting terms (top synonyms)

• Replaceability– How frequently synonyms appear in place of original

query term in collection documents

• Abstractness– Users modify abstract terms with concrete terms

effects on the US educational program prognosis of a political third party

Correlation with P(t | R)0.3719

0.3758

– 0.1872

– 0.1278

Definition Importance Prediction: Experiment Solution

idf: – 0.1339

Page 36: Modeling and Solving Term Mismatch for Full-Text Retrieval

36

Prediction Model

Regression modeling– Model:

M: <f1, f2, .., f5> P(t | R)– Train on one set of queries (known relevance), – Test on another set of queries (unknown relevance)– RBF kernel Support Vector regression

Definition Importance Prediction: Implement Solution

Page 37: Modeling and Solving Term Mismatch for Full-Text Retrieval

37

Experiments

• Term recall prediction error– L1 loss (absolute prediction error)

• Term recall based term weighting retrieval – Mean Average Precision (overall retrieval success)– Precision at top 10 (precision at top of rank list)

Definition Importance Prediction: Experiment Solution

Page 38: Modeling and Solving Term Mismatch for Full-Text Retrieval

38

Term Recall Prediction Example

party political third viability prognosis

True P(t | R) 0.9796 0.7143 0.5918 0.0408 0.0204

Predicted 0.7585 0.6523 0.6236 0.3080 0.2869

Emphasis

Query: prognosis/viability of a political third party.(Trained on TREC 3)

Definition Importance Prediction: Experiment Solution

Page 39: Modeling and Solving Term Mismatch for Full-Text Retrieval

39

Term Recall Prediction Error

Averag

e (co

nstan

t)

IDF on

ly

All 5 f

eatu

res

Tunin

g meta

-para

meters

TREC 3 rec

urrin

g wor

ds0

0.1

0.2

0.3

Average Absolute Error (L1 loss) on TREC 4

L1 Loss:

The lower, the better

Definition Importance Prediction: Experiment Solution

Page 40: Modeling and Solving Term Mismatch for Full-Text Retrieval

40

• Definition

• Significance

• Mechanism

• Model and solution• Can be predicted,

Framework to design and evaluate features

Main Questions

Page 41: Modeling and Solving Term Mismatch for Full-Text Retrieval

41

A General View of Retrieval Modeling as Transfer Learning

• The traditional restricted view sees a retrieval model as– a document classifier for a given query.

• More general view: A retrieval model really is– a meta-classifier, responsible for many queries,– mapping a query to a document classifier.

• Learning a retrieval model == transfer learning– Using knowledge from related tasks (training queries)

to classify documents for a new task (test query)– Our features and model facilitate the transfer.– More general view more principled investigations

and more advanced techniques

Page 42: Modeling and Solving Term Mismatch for Full-Text Retrieval

42

Using (t | R) in Retrieval Models

• In BM25– Binary Independence Model

• In Language Modeling (LM)– Relevance Model [Lavrenko and Croft 2001]

Definition Importance Prediction Solution: Weighting

Only term weighting, no expansion.

Page 43: Modeling and Solving Term Mismatch for Full-Text Retrieval

43

Predicted Recall Weighting10-25% gain

(MAP)

Definition Importance Prediction Solution: Weighting

“*”: significantly better by sign & randomization tests

Datasets: train -> test

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 140

0.05

0.1

0.15

0.2

0.25Baseline LM descNecessity LM desc

MAP

**

*

*

**

*Recall LM desc

Page 44: Modeling and Solving Term Mismatch for Full-Text Retrieval

44

Predicted Recall Weighting10-20% gain

(top Precision)

Definition Importance Prediction Solution: Weighting

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 140

0.1

0.2

0.3

0.4

0.5

0.6Baseline LM descNecessity LM desc

Prec@10

*

*

!!!

Recall LM desc

“*”: Prec@10 is significantly better.“!”: Prec@20 is significantly better.

Datasets: train -> test

Page 45: Modeling and Solving Term Mismatch for Full-Text Retrieval

45

• Definition

• Significance

• Mechanism

• Model and solution• Term weighting solves emphasis problem for long

queries• Mismatch problem?

Main Questions

Page 46: Modeling and Solving Term Mismatch for Full-Text Retrieval

46

Emphasis 64%

Precision 9%

Failure Analysis of 44 Topics from TREC 6-8

RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

Recall term weighting

Mismatch guided expansion

Basis: Term Mismatch Prediction

Definition Importance Prediction Solution: Expansion

Mismatch 27%

Page 47: Modeling and Solving Term Mismatch for Full-Text Retrieval

47

Recap: Term Mismatch

• Term mismatch ranges 30%-50% on average

• Relevance matching can degrade quickly for multi-word queries

• Solution: Fix every query term

Definition Importance Prediction Solution: Expansion

Page 48: Modeling and Solving Term Mismatch for Full-Text Retrieval

48

Conjunctive Normal Form (CNF) Expansion

E.g. Keyword query:placement of cigarette signs on television watched by children

Manual CNF: (placement OR place OR promotion OR logo OR sign OR signage OR merchandise)AND (cigarette OR cigar OR tobacco)AND (television OR TV OR cable OR network)AND (watch OR view)AND (children OR teen OR juvenile OR kid OR adolescent)

– Expressive & compact (1 CNF == 100s alternatives)– Used by lawyers, librarians and other expert searchers– But, tedious to create

Definition Importance Prediction Solution: Expansion

Page 49: Modeling and Solving Term Mismatch for Full-Text Retrieval

49

Diagnostic Intervention

• Diagnose term mismatch– Terms that need helpplacement of cigarette signs on television watched by children

• Guided expansion intervention (placement OR place OR promotion OR logo OR sign OR signage OR merchandise) AND cigar AND television AND watchAND (children OR teen OR juvenile OR kid OR adolescent)

• Goal– Least amount user effort near optimal performance– E.g. expand 2 terms 90% improvement

Definition Importance Prediction Solution: Expansion

Page 50: Modeling and Solving Term Mismatch for Full-Text Retrieval

50

Diagnostic Intervention (We Hope to)

UserKeyword

query

Diagnosis system

(P(t | R) or idf)

Problem query terms

User expansion

Expansion terms

Query formulation

(CNF or Keyword)

Retrieval engine

Evaluation

(child AND cigar)

(child > cigar)

(child OR teen) AND cigar

(child teen)

Definition Importance Prediction Solution: Expansion

Page 51: Modeling and Solving Term Mismatch for Full-Text Retrieval

51

Diagnostic Intervention (We Hope to)

UserKeyword

query

Diagnosis system

(P(t | R) or idf)

Problem query terms

User expansion

Expansion terms

Query formulation

(CNF or Keyword)

Retrieval engine

Evaluation

(child AND cigar)

(child > cigar)

(child OR teen) AND cigar

(child teen)

Definition Importance Prediction Solution: Expansion

Page 52: Modeling and Solving Term Mismatch for Full-Text Retrieval

52

Expert userKeyword

query

Diagnosis system

(P(t | R) or idf)

Problem query terms

User expansion

Expansion terms

Query formulation

(CNF or Keyword)

Retrieval engine

Evaluation

Online simulation

Online simulation

We Ended up Using Simulation

(child teen)

(child OR teen) AND cigar

(child OR teen) AND (cigar OR tobacco)

FullCNFOffline

(child AND cigar)

(child > cigar)

Definition Importance Prediction Solution: Expansion

Page 53: Modeling and Solving Term Mismatch for Full-Text Retrieval

53

Diagnostic Intervention Datasets

• Document sets– TREC 2007 Legal track, 7 million tobacco company– TREC 4 Ad hoc track, 0.5 million newswire

• CNF Queries– TREC 2007 by lawyers, TREC 4 by Univ. Waterloo

• Relevance Judgments– TREC 2007 sparse, TREC 4 dense

• Evaluation measures– TREC 2007 statAP, TREC 4 MAP

Definition Importance Prediction Solution: Expansion

Page 54: Modeling and Solving Term Mismatch for Full-Text Retrieval

54

P(t | R) vs. idf diagnosis

Results – Diagnosis

Diagnostic CNF expansion on TREC 4 and 2007

0 1 2 3 4 All0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

P(t | R) on TREC 2007idf on TREC 2007P(t | R) on TREC 4idf on TREC 4

# query terms selected

Gain in retrieval (MAP)

Definition Importance Prediction Solution: Expansion

Page 55: Modeling and Solving Term Mismatch for Full-Text Retrieval

55

Results – Form of Expansion

CNF vs. bag-of-word expansion

0 1 2 3 4 All0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

CNF on TREC 4

Bag of word on TREC 4

CNF on TREC 2007

Bag of word on TREC 2007

# query terms selected

Retrieval performance (MAP)

P(t | R) guided expansion on TREC 4 and 2007

Definition Importance Prediction Solution: Expansion

Page 56: Modeling and Solving Term Mismatch for Full-Text Retrieval

56

• Definition

• Significance

• Mechanism

• Model and solution• Term weighting for long queries• Term mismatch prediction diagnoses problem terms,

and produces simple & effective CNF queries

Main Questions

Page 57: Modeling and Solving Term Mismatch for Full-Text Retrieval

57

Efficient P(t | R) Prediction

• 3-10X speedup (close to simple keyword retrieval), while maintaining 70-90% of the gain

• Predict using P(t | R) values from similar, previously-seen queries

[SIGIR 2012]

Definition Importance Prediction: Efficiency Solution: Weighting

Page 58: Modeling and Solving Term Mismatch for Full-Text Retrieval

58

Structural Mismatch in QA

• Semantic role label (SRL) structure

• Example

Who got the cheese?

Rat sold the cheese to a mouse called Mole.

• Problem– Structure & term level mismatch, arg0arg2 of target2

• Solution– Jointly model field translations, learnt from true QA pairs– Use Indri to query these alternative structures

• 20% gain in MAP vs. strict question structure

[Target]

[arg1][arg0]

[Target1][arg2][arg1][arg0]

[Target2]

[arg2][arg1]

[CIKM 2009]

Definition Importance Prediction: Structure Solution: Expansion

Le
Only show to a place with significant NLP representation.Show this to motivate the use of NLP in IR. Hard, because of mismatch.
Page 59: Modeling and Solving Term Mismatch for Full-Text Retrieval

59

Contributions & Future Work

• Two long standing problems: mismatch & P(t | R)

• Definition and initial quantitative analysis of mismatch– Todo: analyses for new features and prediction methods

• Role of term mismatch in basic retrieval theory– Principled ways to solve term mismatch– Todo: more advanced: learning to rank, transfer learning

• Ways to automatically predict term mismatch– Initial modeling of causes of mismatch, features– Efficient prediction using historic information– Todo: better analysis & modeling of the causes

Definition Importance Prediction Solution

Page 60: Modeling and Solving Term Mismatch for Full-Text Retrieval

60

Contributions & Future Work

• Effectiveness of ad hoc retrieval– Term weighting & diagnostic expansion– Todo: better techniques: automatic CNF expansion,– Todo: better formalism: transfer learning, & more tasks

• Diagnostic intervention– Term level diagnosis guides targeted expansion– Todo: diagnose specific types of mismatch problems

or different problems (mismatch/emphasis/precision)• Guide NLP, Personalization, etc. to solve the real problem

– Todo: proactively identify search and other user needs

Definition Importance Prediction Solution

Page 61: Modeling and Solving Term Mismatch for Full-Text Retrieval

61

Le’s Other Work during CMU Years

• Building datasets and tools:– ClueWeb09 (dataset), REAP (corpus for the intelligent tutor)– Open source IR toolkit Lemur/Indri– Large Scale Computing, Hadoop tutorial & HWs

• Other research:– Structured documents, queries and models of retrieval– IR tasks: Legal e-Discovery, Bio/Medical/Chemical Patent retrieval– IR for Human Language Technology applications:

• QA (Javelin), Tutoring (REAP), KB (Read the Web),IE (Wei Xu@NYU)

Le Zhao ([email protected])

Page 62: Modeling and Solving Term Mismatch for Full-Text Retrieval

62

Page 63: Modeling and Solving Term Mismatch for Full-Text Retrieval

63

Acknowledgements• Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft

• Friends: Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui (Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Jin Young Kim, Chengtao Wen

– Discussions & references & feedback

• Reviewers

• David Fisher, Mark Hoy, David Pane– Maintaining the Lemur toolkit

• Andrea Bastoni and Lorenzo Clemente– Maintaining LSI code for Lemur toolkit

• SVM-light, Stanford parser

• TREC – All the data

• NSF Grants IIS-0707801, IIS-0534345, IIS-1018317

• Xiangmin Jin, and my whole family

Page 64: Modeling and Solving Term Mismatch for Full-Text Retrieval

64

• Example query: political third party viability

All Documents

The Vocab Mismatch Problem

Rpolitical

third

party

viability

Query term t political third party viability

%rel matching t 0.7143 0.5918 0.9796 0.0408

Page 65: Modeling and Solving Term Mismatch for Full-Text Retrieval

65

All Documents

Term 1 Term 2R

Page 66: Modeling and Solving Term Mismatch for Full-Text Retrieval

66

Definition of Mismatch P(t | Rq)

Directly calculated given relevance judgments for q

Docs that contain t

Relevant (q)

Collection

Mismatch == 1 – term necessity== 1 – term recall

-

Term Mismatch:P(t | Rq) = 0.6-

Not Concept Necessity, Not Necessity to good performance

Term Necessity: P(t | Rq) = 0.4

Page 67: Modeling and Solving Term Mismatch for Full-Text Retrieval

67

Prior Definition of Mismatch

• Vocabulary mismatch (Furnas et al., 1987)– How likely 2 people disagree in vocab choice– Domain experts disagree 80-90% of the times– Leads to Latent Semantic Indexing (Deerwester et al.,

1988)– Query independent– = Avgq P(t | Rq)– can be reduced to our query dependent definition of

term mismatch

-

Page 68: Modeling and Solving Term Mismatch for Full-Text Retrieval

68

KnowledgeHow Necessity explains behavior of IR techniques

• Why weight query bigrams 0.1, while query unigrams 0.9?– Bigram decreases term recall, weight reflects recall

• Why Bigram not gaining stable improvements?– Term recall is more of a problem

• Why using document structure (field, semantic annotation) not improving performance?– Improves precision, need to solve structural mismatch

• Word sense disambiguation– Enhances precision, instead, should use in mismatch modeling!

• Identify query term sense, for searchonym id, or learning across queries• Disambiguate collection term sense for more accurate replaceability

• Personalization– biases results to what a community/person likes to read (precision)– may work well in a mobile setting, short queries

Page 69: Modeling and Solving Term Mismatch for Full-Text Retrieval

69

Why Necessity?System Failure Analysis

• Reliable Information Access (RIA) workshop (2003)– Failure analysis for 7 top research IR systems

• 11 groups of researchers (both academia & industry)• 28 people directly involved in the analysis (senior & junior)• >56 human*weeks (analysis + running experiments)• 45 topics selected from 150 TREC 6-8 (difficult topics)

– Causes (necessity in various disguise)• Emphasize 1 aspect, missing another aspect (14+2 topics)• Emphasize 1 aspect, missing another term (7 topics)• Missing either 1 of 2 aspects, need both (5 topics)• Missing difficult aspect that need human help (7 topics)• Need to expand a general term e.g. “Europe” (4 topics)• Precision problem, e.g. “euro”, not “euro-…” (4 topics)

Page 70: Modeling and Solving Term Mismatch for Full-Text Retrieval

70

Page 71: Modeling and Solving Term Mismatch for Full-Text Retrieval

71

Page 72: Modeling and Solving Term Mismatch for Full-Text Retrieval

72

Local LSI Top Similar Terms

Oil spills Insurance coverage which pays for long term care

Term limitations for US Congress members

Vitamin the cure of or cause for human ailments

oil term term ailspill 0.5828 term 0.3310 term 0.3339 ail 0.4415

oil 0.4210 long 0.2173 limit 0.1696 health 0.0825

tank 0.0986 nurse 0.2114 ballot 0.1115 disease 0.0720

crude 0.0972 care 0.1694 elect 0.1042 basler 0.0718

water 0.0830 home 0.1268 care 0.0997 dr 0.0695

Page 73: Modeling and Solving Term Mismatch for Full-Text Retrieval

73

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2 Error plot of necessity predictions

Necessity truth

Predicted necessity

Prediction trend (3rd order polynomial fit)

Pro

bab

ilit

y

Page 74: Modeling and Solving Term Mismatch for Full-Text Retrieval

74

Necessity vs. idf (and emphasis)

Page 75: Modeling and Solving Term Mismatch for Full-Text Retrieval

75

True Necessity Weighting

TREC 4 6 8 9 10 12 14

Document collection disk 2,3 disk 4,5 d4,5 w/o cr WT10g .GOV .GOV2

Topic numbers 201-250 301-350 401-450 451-500 501-550 TD1-50 751-800

LM desc – Baseline 0.1789 0.1586 0.1923 0.2145 0.1627 0.0239 0.1789

LM desc – Necessity 0.2703 0.2808 0.3057 0.2770 0.2216 0.0868 0.2674

Improvement 51.09% 77.05% 58.97% 29.14% 36.20% 261.7% 49.47%

p - randomization 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001

p - sign test 0.0000 0.0000 0.0000 0.0005 0.0000 0.0000 0.0002

Multinomial-abs 0.1988 0.2088 0.2345 0.2239 0.1653 0.0645 0.2150

Multinomial RM 0.2613 0.2660 0.2969 0.2590 0.2259 0.1219 0.2260

Okapi desc – Baseline 0.2055 0.1773 0.2183 0.1944 0.1591 0.0449 0.2058

Okapi desc – Necessity 0.2679 0.2786 0.2894 0.2387 0.2003 0.0776 0.2403

LM title – Baseline N/A 0.2362 0.2518 0.1890 0.1577 0.0964 0.2511

LM title – Necessity N/A 0.2514 0.2606 0.2058 0.2137 0.1042 0.2674

Page 76: Modeling and Solving Term Mismatch for Full-Text Retrieval

76

Predicted Necessity Weighting10-25% gain

(necessity weight)10-20% gain

(top Precision)

TREC train sets 3 3-5 3-7 7Test/x-validation 4 6 8 8LM desc – Baseline 0.1789 0.1586 0.1923 0.1923LM desc – Necessity 0.2261 0.1959 0.2314 0.2333Improvement 26.38% 23.52% 20.33% 21.32%

P@10Baseline 0.4160 0.2980 0.3860 0.3860Necessity 0.4940 0.3420 0.4220 0.4380

P@20Baseline 0.3450 0.2440 0.3310 0.3310Necessity 0.4180 0.2900 0.3540 0.3610

Page 77: Modeling and Solving Term Mismatch for Full-Text Retrieval

77

TREC train sets 3-9 9 11 13Test/x-validation 10 10 12 14LM desc – Baseline 0.1627 0.1627 0.0239 0.1789LM desc – Necessity 0.1813 0.1810 0.0597 0.2233Improvement 11.43% 11.25% 149.8% 24.82%

P@10Baseline 0.3180 0.3180 0.0200 0.4720Necessity 0.3280 0.3400 0.0467 0.5360

P@20Baseline 0.2400 0.2400 0.0211 0.4460Necessity 0.2790 0.2810 0.0411 0.5030

Predicted Necessity Weighting (ctd.)

Page 78: Modeling and Solving Term Mismatch for Full-Text Retrieval

78

vs. Relevance Model

Weight Only ≈ ExpansionSupervised > Unsupervised

(5-10%)

Relevance Model: #weight( 1-λ #combine( t1 t2 ) λ #weight( w1 t1

w2 t2

w3 t3

… ) )

x ~ yw1 ~ P(t1|R)w2 ~ P(t2|R)

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

x

y

Test/x-validation 4 6 8 8 10 10 12 14

LM desc – Baseline 0.1789 0.1586 0.1923 0.1923 0.1627 0.1627 0.0239 0.1789

Relevance Model desc 0.2423 0.1799 0.2352 0.2352 0.1888 0.1888 0.0221 0.1774

RM reweight-Only desc 0.2215 0.1705 0.2435 0.2435 0.1700 0.1700 0.0692 0.1945

RM reweight-Trained desc 0.2330 0.1921 0.2542 0.2563 0.1809 0.1793 0.0534 0.2258

Page 79: Modeling and Solving Term Mismatch for Full-Text Retrieval

79

Feature Correlation

f1 Centr f2 Syn f3 Repl f4 DepLeaf f5 idf RMw

0.3719 0.3758 -0.1872 0.1278 -0.1339 0.6296

Predicted Necessity: 0.7989 (TREC 4 test set)

Page 80: Modeling and Solving Term Mismatch for Full-Text Retrieval

80

vs. Relevance Model

1. Relevance Model: #weight( 1-λ #combine( t1 t2 ) λ #weight( w1 t1

w2 t2

w3 t3

… ) )

Page 81: Modeling and Solving Term Mismatch for Full-Text Retrieval

81

vs. Relevance Model

1. Relevance Model: #weight( 1-λ #combine( t1 t2 ) λ #weight( w1 t1

w2 t2

w3 t3

… ) )

x ~ yw1 ~ P(t1|R)w2 ~ P(t2|R)

x

y

2. RM Reweight-Only query terms

3. RM Reweight-Trained

RM weight ~ Necessity0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Page 82: Modeling and Solving Term Mismatch for Full-Text Retrieval

82

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 140

0.05

0.1

0.15

0.2

0.25

0.3

RM Reweight-Trained desc

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 140

0.05

0.1

0.15

0.2

0.25

0.3

Baseline LM desc

Relevance Model desc

MAPMAP

vs. Relevance Model

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 140

0.05

0.1

0.15

0.2

0.25

0.3

RM Reweight-Only desc

Weight Only ≈ Expansion

RM is unstable

Supervised > Unsupervised

(5-10%)

Datasets: train -> test

Page 83: Modeling and Solving Term Mismatch for Full-Text Retrieval

83

Using Document Structure

• Stylistic: XML

• Syntactic/Semantic: POS, Semantic Role Label

• Current approaches– All precision oriented

• Need to solve mismatch first?

Page 84: Modeling and Solving Term Mismatch for Full-Text Retrieval

84

Collections

Retrieval Model

Motivation

• Search is important, information portal

• Search is research worthy– SIGIR, WWW, CIKM, ASIST, ECIR, AIRS, – Search is difficult

• Retrieval modeling difficulty >= sentence paraphrasing– Since 1970s, but still not fully understood, basic problem like mismatch– Adapt to changing requirements of mobile, social and semantic Web

• Modeling user’s needsUser

QueryResultsActivities

User QueryRetrieval

Model

Document Collection

Results

Page 85: Modeling and Solving Term Mismatch for Full-Text Retrieval

85

Online or Offline Study?

• Controlling confounding variables– Quality of expansion terms– User’s prior knowledge of the topic– Interaction form & effort

• Enrolling many users and repeating experiments

• Offline simulations can avoid all these and still make reasonable observations

Page 86: Modeling and Solving Term Mismatch for Full-Text Retrieval

86

Simulation Assumptions

• Real full CNFs to simulate partial expansions

• 3 assumptions about user expansion process– Expansion of individual terms are independent of

each other• A1: always same set of expansion terms for a given query

term, no matter which subset of query terms get expanded.• A2: same sequence of expansion terms, no matter …

– A3: Keyword query is re-constructed from the CNF query• Procedure to ensure vocabulary faithful to that of the original

keyword description• Highly effective CNF queries ensure reasonable kw baseline

Page 87: Modeling and Solving Term Mismatch for Full-Text Retrieval

87

Efficient P(t | R) Prediction (2)

• Causes of P(t | R) variation of same term in different queries– Different query semantics: Canada or Mexico vs.

Canada– Different word sense: bear (verb) vs. bear (noun)– Different word use: Seasonal affective disorder

syndrome (SADS) vs. Agoraphobia as a disorder– Difference in association level with topic

• Use historic occurrences to predict current– 70-90% of the total gain– 3-10X faster, close to simple keyword retrieval

Page 88: Modeling and Solving Term Mismatch for Full-Text Retrieval

88

Efficient P(t | R) Prediction (2)

• Low variation of same term in different queries

• Use historic occurrences to predict current– 3-10X faster, close to the slower method & real time

3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10

9 -> 10 11 -> 12 13 -> 140

0.05

0.1

0.15

0.2

0.25

0.3Baseline LM descNecessity LM descEfficient Prediction

MAP

*

*

**

train -> test

Page 89: Modeling and Solving Term Mismatch for Full-Text Retrieval

89

Take Home Message for Ordinary Search Users (people and software)

Page 90: Modeling and Solving Term Mismatch for Full-Text Retrieval

90

Be mean! Is the term Necessary for

doc relevance?

If not, remove, replace or expand.