Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analyzing Retrieval Models using

Retrievability Measurement

Shariq Bashir

Supervisor: ao. Univ. Prof. Dr. Andreas Rauber

Institute of Software Engineering and Interactive Systems

Vienna University of Technology

[email protected]

http://www.ifs.tuwien.ac.at/~bashir/

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Outline

Introduction to Retrievability (Findability) Measure

Setup for Experiments

Findability Scoring Functions

Relationship between Findability and Query

Characteristics

Relationship between Findability and Document

Features

Relationship between Findability and Effectiveness

Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Introduction

Retrieval Systems are used for searching information

Rely on retrieval models for ranking documents

How to select best Retrieval Model

Evaluate Retrieval Models

State of the Art– Effectiveness Analysis, or– Efficiency (Speed/Memory)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Effectiveness Measures

(Precision, Recall, MAP) depends upon – Few topics– Few judged documents

Suitable for precision oriented retrieval task

Less suitable for recall oriented retrieval task – (e.g. patent or legal retrieval)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Findability Measure

Considers all documents

The goal is to maximize the findability of documents

Documents in Retrieval Model having higher findability are more easy to find than Retrieval Model having lower findability

Applications– Offers another measure for comparing Retrieval

Models– Subset of documents that are hard or easy to find

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Findability Measure

Factors that affect Findability

1. User Query– [Query = Data Mining books] vs

[Query = Han Kamber books] • for searching book “Data Mining

Concepts and Techniques”

2. The maximum number of top links/docs checked

3. The ranking strategy of Retrieval Models

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Retrievability Measure

[Leif Azzopardi and Vishwa Vinay, CIKM 2008] Given a collection D of documents, and query set Q retrievability of dD

kdq rank of dD in the result set of query qQ c the point in rank list where user will stop f(kdq,c) = 1 if kdq<= c, and 0 otherwise

Gini-Coefficient = Summarize findability scores

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Outline

Introduction to Findability Measure


Retrievability Scoring Functions


Characteristics


Features


Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


Collections1. TREC Chemical Retrieval Track Collection 2009 (TREC-CRT)2. USPTO Patent Collections

• USPC Class 433 (Dentistry) (DentPat)• USPC Class 422 (Chemical apparatus and process disinfecting,

deodorizing, preserving, or sterilizing) (ChemAppPat)

3. Austrian News Dataset (ATNews)

TREC-CRT, ATNews are more skewedUSPTO Collections are less skewed

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


Retrieval Models– Standard Retrieval Models

• TFIDF, NormTFIDF, BM25, SMART– Language Models

• Jelinek-Mercer Smoothing, Dirichlet Smoothing (DirS), Two-Stage Smoothing (TwoStage), Absolute Discounting Smoothing (AbsDis)

Query Generation– All sections of Patent documents– Terms removed with document frequency (df) > 25%– All term combinations of 3- and 4-terms

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

52 443 583 746 962 1474

Docs. Ordered by Increasing Vocabulary Size


TREC-CRT ATNews

ChemAppPatDentPat

5 101 155 198 255 427


243 597 690 776 895


284 381 426 463 504 559 866


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Outline

Introduction to Retrievability Measure




Characteristics


Features


Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


Standard Findability Scoring Function– Does not consider the difference in Docs.

vocabulary size– Biased towards long documents

– With r(d), Doc2 has higher Findability than Doc5

– But, due to small vocabulary size Doc5 does not have larger query subset

All 3-Terms combinations

Findability Percentage Doc2 = 3600/6545 = 0.55

Doc5 = 90/120 = 0.75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


Normalize Findability– Normalize r(d) relative to number of Queries generated from d– This will account for the difference between doc lengths

– (d) queries generated from d

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


Comparison between r(d) and r^(d)– Retrieval ordered by Gini-Coefficients (Retrieval Bias)– Findability Ranks of Documents

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


Retrieval Model

c=100r(d)

Retrieval Model

c=100r^(d)

BM25 0.48 DirS 0.69

TwoStage 0.49 AbsDis 0.69

DirS 0.51 JM 0.69

AbsDis 0.56 BM25 0.71

NormTFIDF

0.57 TwoStage 0.72

JM 0.59 NormTFIDF 0.72

TFIDF 0.78 TFIDF 0.94

SMART 0.92 SMART 0.95

TREC-CRT ChemAppPat

Retrieval Model

c=10r(d)

Retrieval Model

c=10r^(d)

BM25 0.33 JM 0.37

AbsDis 0.34 BM25 0.38

DirS 0.36 AbsDis 0.38

TwoStage 0.37 DirS 0.39

JM 0.39 TwoStage 0.42

NormTFIDF

0.40 NormTFIDF 0.42

TFIDF 0.47 TFIDF 0.56

SMART 0.85 SMART 0.56

Correlation between r(d) and in Terms of Gini-Coefficients

Retrieval Models are ordered by r(d) and r^(d)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


Correlation between r(d) and in Terms of Documents Findability Ranks

– TREC-CRT and ATNews• The correlation between r(d) and is low (high difference)• Due to large difference between document lengths

– ChemAppPat and DentPat• The correlation between r(d) and is high (low difference)• Due to not large difference between document lengths

Correlation between r(d) and r^(d)

Back

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


Which Findability Functions is better (r(d) or r^(d) ).– On Gini-Coefficient it is difficult to decide

. . . . .Bucket 1 Bucket 2 Bucket 30

Ordered the documents based on findability scores and then partitioned into 30 Buckets

40 Random Docs (Known Items)



One Query/Document between 4 – 6 length



Low Findability Buckets <---------------------------------> High Findability Buckets

. . . . .

. . . . .

The goal is to search known-item using its own Query Effectiveness of Known-Items is measured through Mean Reciprocal Rank (MRR)

Low MRR Effectiveness <-------------Expected Results-------> High MRR Effectiveness

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


Which Findability Functions is better (r(d) or r^(d) ).– Expected Results

• High findability buckets should have high effectiveness, since they are easy to findable than low findability buckets

• Positive correlation with MRR

– r^(d) buckets have good positive correlation with MRR than r(d)

TREC-CRT ChemAppPat

Correlation between Findability and MRR

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Outline





Characteristics


Features


Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Query Characteristics and Findability

Q = Query SetFindability Score of

Documents GINI-Coefficients

Queries do not have similar quality Some queries are very specific (target oriented)

than others What is the effect of query quality on Findability Need to analyze Findability with different query

quality subsets

Creating Query Quality Subsets– Supervised Quality Labels: We do not have supervised labels– Query Characteristics (QC):

• Query Result List size• Query Term Frequencies in the Documents• Query Quality based on Query Performance Prediction Methods

For each QC, large query set is partitioned into 50 subsets

Current Findability Analysis Style

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Query Characteristics and Findability

Query Subsets with Query Quality

Q = Query Set

Query Quality is predicted Simplified Clarity Score (SCS) [He & Ounis SPIRE

2004]

Q ordered by SCS score

And Partitioned into 50 Subsets

Query Subset 1 = Findability Analysis



TR

EC

-CR

T

Colle

ctio

ns

• X-Axis = Query Subsets ordered by Low SCS score to High SCS score

• Y-Axis = Gini-Coefficients

• Low SCS scores Subsets = High Gini-Coefficients

• High SCS scores Subsets = Low Gini-Coefficients

...

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Outline





Characteristics


Features


Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Document Features and Findability

Findability Analysis

Query Processing

Large Processing Time

Large Computation Resources

Relationship between Document Features and Findability Scores

Can we predict Findability without processing exhaustive set of queries

Does not require heavy Processing

Only predict Findability Ranks

Can’t predict Gini-Coefficients

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


The following three classes of document features are considered

– Surface Level features• Based on (Term Frequencies within Documents) and (Term

Document Frequencies within Collection)

– Features based on Term Weights• Based on the Term Weighting strategy of retrieval model

– Density around Nearest Neighbors of Documents• These features are based on the density around nearest neighbors

of documents

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


# Feature Description

1 NATF Average of the normalized term frequencies of a document

2 freq Counts the frequency of high frequent terms of a document (tft,d/|d| > 0.03)

3 NATF_freq Computes the NATF with the frequent terms of (freq) feature

4 GC_terms Computes the term frequency inequality between terms of a document

5 freq_GC The total number of terms of a document that increase the GC_terms score greater than GC_terms = 0.25

6 ADF Considers the average document frequency of the terms

7 freq_low_df

Counts the frequency of terms of a document that have document frequency < 5% of total collection size

8 ADF_freq Computes the ADF score only based on the freq_low_df terms

9 Document Length

Document length

10 Vocabulary Size

Total number of unique terms

Surface Level Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


TREC-CRT

ChemAppPat

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


Combining Multiple Features– No feature performs best for all collections and for all retrieval

models– Worth to analyze to what extent combining multiple features

increases the correlation– Regression Tree, 50%/50% training/testing splitting

Correlation by combining multiple features

Correlation with best single feature

% of increase in correlation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Outline





Characteristics


Features


Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


IR

Goal: Maximizing Findability

Does not need Relevance Judgments

Findability Measure Effectiveness Measures (Recall, Precision, MAP)

Does any relationship exist between both?

Maximizing Findability -> Maximizing Effectiveness

If relationship exists

Automatic Retrieval Models Ranking Tuning/Increasing Retrieval Model Effectiveness on

the basis of Findability Measure

Goal: Maximizing Effectiveness

Depends upon Relevance Judgments

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


Retrieval Models– Standard Retrieval Models and Language Models– Low Level Features of IR (tf, idf, doc length, vocabulary size,

collection frequency)– Term Proximity based Retrieval Models


1 f1 (SumMinDist) Sum of minimum distances of all query term pairs

2 f2 (SumMaxDist) Sum of maximum distances of all query term pairs

3 f3 (AvgDist) Average of the sum of all query term pairs distances in the document

4 f4 (MinDistCount) Frequency of query’s term pairs that have a minimum distance of less than 4 terms

5 f5 (AvgPairDist) Similar to f3, this feature calculates the average of the sum of distances between all query’s term pairs and all single terms of a query

6 f6 (CoOccurrence) Counts the frequency of co-occurrence of query’s term pairs within a window of less than 4 terms

7 f7 (PairCoOccurrence)

Counts the frequency of co-occurrence of query’s term pairs with single terms of query within a window of less than 10 terms

8 f8 (MinCover) Shortest text segment in the document that covers all terms of a query at least once

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Feature Gini-Coefficient with c=100 Feature Recall@100

1 PairCoOccurrence

0.39 1 JM 0.184

2 MinDistCount 0.45 2 DirS 0.177

3 CoOccurrence 0.49 3 TwoStage 0.174

4 BM25 0.52 4 AbsDis 0.170

5 SumMinDist 0.55 5 MinDistCount 0.156

6 TwoStage 0.56 6 BM25 0.156

7 DirS 0.57 7 CoOccurrence 0.147

8 MinCover 0.58 8 PairCoOccurrence

0.139

9 AbsDis 0.60 9 AvgPairDist 0.134

10 JM 0.62 10 MinCover 0.130

11 NormTFIDF 0.62 11 SumMinDist 0.126

12 ntf(d,q) 0.63 12 ntf(d,q) 0.107

13 AvgPairDist 0.66 13 SumMaxDist 0.107

14 AvgDist 0.67 14 AvgDist 0.106

15 SumMaxDist 0.68 15 NormTFIDF 0.082

16 |d| 0.74 16 SMART 0.074

17 sdf(d,q) 0.85 17 sdf(d,q) 0.042

18 scf(d,q) 0.85 18 tf(d,q) 0.016

19 TFIDF 0.91 19 TFIDF 0.008

20 tf(d,q) 0.92 20 scf(d,q) 0.002

21 SMART 0.93 21 |d| 0.001

22 Td 0.99 22 Td 0.001

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


Correlation = 0.80 0.75 0.80 0.73

Correlation exists Not perfect, but retrieval models having low retrieval bias

consistently appear in at least top half of the ranks

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


Tuning Parameter values over Findability – Retrieval Models contain parameters

– Controls the query term normalization or smooth the document relevance score in case of unseen query terms

– We tune the parameter values over findability

– Examine this effect on Gini-Coefficient and Recall/Precision/MAP

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


Parameter b values are changed between 0 to 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


For JM Parameter values are changed between 0 to 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


Evolving Retrieval Model using Genetic Programming and Findability– Genetic Programming branch of soft computing– Helps to solve exhaustive search space problems

Retrieval Features

Randomly Combine IR Features

Selecting Best Retrieval Model

(Findability Measure)

Next Generation

Initial population

Recombination (Crossover, Mutation)

Repeat until 100 generations complete

Genetic Programming

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


Evolving Retrieval Model using Genetic Programming and Findability

– Solution (Retrieval Model) are represented with Tree structure.– Nodes of trees either operators (+,/,*) or ranking features– Ranking Features

• Low Level Retrieval Features• Term Proximity based Retrieval Features• Constant Values (0.1 to 1)

– 100 Generations are evolved with 50 solutions per generation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


Evolving Retrieval Model using Genetic Programming and Findability

– Two correlation analysis are test

– (1) Relationship between Findability and Effectiveness on the basis of fittest individual of each generation

– (2) Relationship between Findability and Effectiveness on the basis of average fitness of each generation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


Evolving Retrieval Model using Genetic Programming and Findability– (First): Relationship between Findability and Effectiveness on

the basis of Fittest individual of each generation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


Evolving Retrieval Model using Genetic Programming and Findability– (Second): Relationship between Findability and Effectiveness

on the basis of Average Fitness of each generation– Generations having low average Gini-Coefficient also have high

effectiveness on Recall@100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Conclusions

Findability focuses on all documents not set of few judged documents

We propose normalized findability scoring function that produces better findability rank of documents

Analysis between findability and query characteristics– Different ranges of query characteristics have different retrieval bias

Analysis between findability and document features– Suitable for predicting document findability ranks

Relationship between findability and effectiveness– Findability can be used for automatic ranking– Used to find tune IR systems in un-supervised manner

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Future Work

Query Popularity and Findability– We are not differentiating between popular and unpopular

queries

Visualizing Findability – Documents that are high findable with one model– Documents that are high findable with multiple models– Documents that are not findable with all models

Effect of Retrieval Bias in K-Nearest Neighbor classification – High Findable samples also affect the classification voting in K-

NN

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Thank You

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Gini-Coefficient

Gini-Coefficient calculates retrievability inequality between documents.

Also represents retrieval bias. Provides bird-eye view. If G = 0, then no bias, If G = 1, then only one document is

Findable, and all other document have r(d) = 0.

Documents

r(d) with RS1

r(d) with RS2

D1 2 9

D2 0 7

D3 6 12

D4 5 14

D5 34 18

D6 4 11

D7 39 19

Gini-Coefficient

0.58 0.18

Back

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


ChemAppPat

TREC-CRT

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Document Features and Retrievability

Features based on Term Weights– Terms of Documents are weighted by retrieval model.– Then terms are added into inverted lists.– Term weights in the inverted lists are sorted by

decreasing score.


1 ATW Computes the average of term weights of a document.

2 ATRP Computes the average of term rank positions in the inverted lists.

3 VTRP Variance of term rank positions in the inverted lists.

4 DiffMedainWeights

Computes the average of difference of term weights of a document with the median weight of terms.

5 LowRankRatio Computes how many terms of a document are appeared in the top 200 rank positions of sorted inverted lists.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


On high skewed collections, these features have good correlation. On less skewed collections, these features do not have good

correlation. This may be because, in less skewed collection the term weights of

documents are less extreme due to almost similar doc lengths.

TREC-CRT

ChemAppPat

Back

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


Document Density based Features– These feature are based on average density of the k-nearest

neighbors of documents.– k is used with 50,100, and 150.– Density is also computed with all terms of a document and top

40 (high frequency) terms of a document.

# Feature Description1 AvgDensity(k=50) Average density with 50-nearest neighbors.

2 AvgDensity(k=100) Average density with 100-nearest neighbors.

3 AvgDensity(k=250) Average density with 150-nearest neighbors.

4 AvgDensity-Top40Terms (k=50)

Average density with 50-nearest neighbors and top 40 terms.





. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Query Expansion and Retrievability

Query Expansion methods are investigate for improving retrievability. Terms for expansion are identified via Pseudo-Relevance Feedback (PRF). Baseline results are promising (PRF selection with Top-N docs). We further, propose two PRF selection approaches

– Based on Documents Clustering.– Similarity of Retrieved Documents with Query Patent.

q = QueryD9

D4

D3

D5

D11

D2

D1

D14

Process Query

Ranked List

Pseu

do

Relevan

ce Feed

back

Extract Expansion Terms (E) from

{D9, D4, D3, D5}

qEQuery Expansion

Process Query Process Query

1. Top-N

2. Document Clustering

3. Query Patent Similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


Baseline Approaches– Both approaches rely on Top documents of query

result lists for PRF.

– Query Expansion based on Language Modeling.• Terms for expansion are ranked according to sum of

divergences between the documents they occurred and the importance of terms in the whole collection.

– Query Expansion based on Kullback-Leibler Divergence.

• Terms of expansion are ranked according to the relevance rareness of terms in PRF set as opposed to the whole collection.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


TREC-CRT

ChemAppPat

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


q = QueryD9

D4

D3

D5

D11

D2

D1

D14

Process Query

Ranked List

Pseu

do

Relevan

ce Feed

back


{D2, D1, D3, D4}

qEQuery Expansion


D9

D4

D3

D5

D11

D2

D1

D14

Clustering with top N docs

Doc Cluster Size

D9 ( )

D4 (D9, D3)

D3 (D9, D4, D5)

D5 (D9, D3)

D11 (D3, D5)

D2 (D9, D4, D3, D5)

D1 (D9, D4, D3, D5)

D14 (D4)

Sort Docs on Cluster

Size

D2

D1

D3

D4

D5

D11

D14

D9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


D9

D4

D3

D5

D11

D2

D1

D14

Clustering with top N docs

Doc Cluster Size

D9 ( )

D4 (D9, D3)

D3 (D9, D4, D5)

D5 (D9, D3)

D11 (D3, D5)

D2 (D9, D4, D3, D5)

D1 (D9, D4, D3, D5)

D14 (D4)

Sort Docs on Cluster

Size

D2

D1

D3

D4

D5

D11

D14

D9

Constructing Clusters– Offline cluster construction, to avoid large processing time.

– Each document makes its own cluster with other docs using k-Nearest Neighbors.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


PRF Selection via Query Patent Similarity– In prior-art search, patent examiners usually extract

query terms for given query patent.– Due to complex structure of Patent docs, searching

relevant keywords is always a difficult problem.– Missing terms are another problem.

– Query Expansion could help to overcome this problem.

– Query Expansion depends on PRF documents.– PRF documents are ranked on the basis of Query

Patent Similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


q = QueryD9

D4

D3

D5

D11

D2

D1

D14

Process Query

Ranked List

Pseu

do

Relevan

ce Feed

back


{D1, D3, D5, D2}

qEQuery Expansion


D9

D4

D3

D5

D11

D2

D1

D14

Query Patent

Similarity with Query Patent

D1

D3

D5

D2

D11

D9

D14

D4

Rank Docs based on Similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


Full Query Patent can be used for ranking PRF.– Full Query Patent may contains thousands of terms, and could

be distributed in documents not relevant to the query.

How to identify the best terms from query patent. We try to separate the good terms from bad terms

using term classification.

D9

D4

D3

D5

D11

D2

D1

D14

Query Patent

Similarity with Query Patent

D1

D3

D5

D2

D11

D9

D14

D4

Rank Docs based on Similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


Training Dataset for Term Classification

– 30 random prior-art (PA) topics from TREC-CRT.

– From each 30 PA topic, short queries of 4 length (based on high TF) are used as search queries.

– Baseline Score = For each query, PRF documents are ranked according to query relevance scores.

• Effectiveness scores are used a baseline score.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


q = QueryProcess Query

qEQuery Expansion

Process Query

Training Queries

Pseudo Relevance Feedback

Baseline Score

Unique Terms of Query PatentT1T2T3T4..Tn

Check each Term = T

qT

Process Query qE

Query Expansion

Process Query

Pseudo Relevance Feedback

qT Score

If (qT) Score > Baseline Score then T = good Term

If (qT) Score = Baseline Score then T = neutral Term

If (qT) Score < Baseline Score then T = neutral Term

Identifying good, neutral and bad terms

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


PRF Selection via Query Patent Similarity– Terms are classified (predicted) using Term Features.

– Features are identified from Query Patents, based on expanded term (T) and query terms proximity distribution.

– J48 is used for classification.

– The overall accuracy of positive classified samples is 83%.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


PRF Selection via Query Patent Similarity– Results on TREC-CRT collection.– CCGen = PRF Selection using Clustering– QP-TS = PRF selection using query patent similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


PRF Selection via Query Patent Similarity– Results on TREC-CRT collection.– CCGen = PRF Selection using Clustering– QP-TS = PRF selection using query patent similarity

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


The positive correlation indicates low retrievable docs are mostly lied in low density areas.

Their nearest neighbors have mostly similar term weights. This makes them low retrievable.

O High skewed collections, these features have good correlation. On Less skewed collections, these features do not have good correlation. This may be because, in less skewed collections, the term weights of

documents are less extreme due to similar doc lengths.

TREC-CRT

ChemAppPat

Back