Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory

Fusion-based Approach to Web Search Optimization

Kiduk Yang, Ning YuWIDIT Laboratory

SLIS, Indiana University

AIRS2005

2AIRS2005

OUTLINE Introduction

WIDIT in TREC Web Track

Results & Discussion

3AIRS2005

Introduction Web IR

Challenges• Size, Heterogeneity, Quality of data• Diversity of user tasks, interests, characteristics

Opportunities• Diverse Sources of Evidence• Data Abundance

WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion

Research Question What to combine? How to combine?

4AIRS2005

WIDIT in Web Track 2004 Data

Documents• 1.25 million .gov Web pages (18 GB).

Topics (i.e. queries)• 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)

Task Retrieve relevant documents given a mixed set of query types (QT)

Main Strategy Fusion of multiple data representations

• Static Tuning (QT-independent) Fusion of multiple sources of evidence

• Dynamic Tuning (QT-specific)

5AIRS2005

WIDIT: Web IR System ArchitectureWIDIT: Web IR System Architecture

Indexing Module

Sub-indexes

Body Index Anchor Index Header Index

Documents

Topics Queries

Simple Queries

Queries

Expanded Queries

Retrieval ModuleFusion Module

Sub-indexes Sub-indexes

Search Results

Re-ranking Module

Fusion Result

Final Result

Static Tuning

Dynamic Tuning

Query Classification

Module

Query Types

6AIRS2005

WIDIT: Indexing Module

Document Indexing1. Strip HTML tags

• extract title, meta keywords & description, emphasized words• parse out hyperlinks (URL & anchor texts)

2. Create Surrogate Documents• anchor texts of inlinks• header texts (title, meta text, emphasized text)

3. Create Subcollection Indexes• Stop & Stem (Simple, Combo stemmer)• compute SMART & Okapi term weights

4. Compute whole collection term statistics

Query Indexing• Stop & Stem• Identify nouns, phrases• Expand acronyms• Mine synonyms and definitions from Web search

7AIRS2005

WIDIT: Retrieval Module

1. Parallel Searching Multiple Document Index

• body text (title, body)• anchor text (title, inlink anchor text)• header text (title, meta kw & desc, first heading, emphasized words)

Multiple Query formulations• stemming (Simple, Combo)• expanded query (acronym, noun)

Multiple Subcollections• for search speed and scalability• search each subcollection using whole collection term statistics

2. Merge subcollection search results merge & sort by document score

8AIRS2005

WIDIT: Fusion Module

Fusion Formulas Weighted Sum

• FSws = (wi*NSi)

Select candidate systems to combine Top performers in each category

• e.g. best stemmer, qry expansion, doc index Diverse systems

• e.g. Content-based, Link-based One-time brute force combinations for validation

• Complementary Strength effect

Determine system weights (wi) Static Tuning

• Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data

• Select the formulas with best performance

wi = weight of system i (relative contribution of each system)

NSi = normalized score of a document by system i = (Si – Smin) / (Smax – Smin)

9AIRS2005

WIDIT: Query Classification Module Statistical Classification (SC)

Classifiers• Naïve Bayes• SVM

Training Data• Titles of 2003 topics (50 TD, 150 HP, 150 NP)• w/ and w/o stemming (Combo stemmer)

Training Data Enrichment for TD class• Added top-level Yahoo Government category labels

Linguistic Classification (LC) Word Cues

• Create HP and NP lexicons Ad-hoc heuristic

• e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic

Combination More ad-hoc heuristic

• if strong word cue, LCelse if single word, TDelse SC

10AIRS2005

WIDIT: Re-ranking Module Re-ranking Features

Field-specific Match• Query words, acronyms, phrases in URL, title, header, anchor text

Exact Match• title, header text, anchor text• body text

Indegree & Outdegree URL Type: root, subroot, path, file

• based on URL ending and slash count Page Type: HPP, HP, NPP, NP, ??

• based on word cue & heuristic

Re-ranking Formula Weighted sum of re-ranking features

Dynamic Tuning Dynamic/interactive optimization of the QT-specific re-ranking formula

11AIRS2005

WIDIT: Dynamic Tuning Interface WL

12AIRS2005

Dynamic Tuning: Observations Effective re-ranking factors

HP• indegree, outdegree, exact match, URL/pagetype• minimum number of outdegree =1

NP• indegree, outdegree, URLtype

o 1/3 impact of HP TD

• acronym, outdegree, URLtype• minimum number of outdegree =10

Strength Combines the human intelligence (pattern recognition) w/

computational power of machine Good for system tuning with many parameters Facilitates failure analysis

Weakness Over-tuning Sensitive to initial results & re-ranking parameter selection

13AIRS2005

Results Run Descriptions

Best fusion run: F3• 0.4*A + 0.3*F1 + 0.3*F2

where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header

Dynamic re-ranking runs• (DR_o) w/ official QT

Observations Dynamic tuning works well

• significant improvement over baseline (TD, HP) NP reranking needs to be optimized

• relatively small improvement by reranking

MAP (TD)

MRR (NP)

MRR (HP)

DR_o 0.1349(+38.5%)

0.6545(+ 6.7%)

0.6265(+47.2%)

F3 (baseline) 0.0974 0.6134 0.4256

TREC Median 0.1010 0.5888 0.5838

14AIRS2005

Discussion: Web IR Methods What worked?

Fusion• Combining multiple sources of evidence (MSE)

Dynamic Tuning• Helps multi-parameter tuning & failure analysis

What next? Expanded MSE mining

• Web server and search engine logs Enhanced Reranking Feature Selection & Scoring

• Modified PageRank/HITS• Link noise reduction based on page layout

Streamlined Fusion Optimization

15AIRS2005

Discussion: Fusion Optimization Conventional Fusion Optimization approaches

Exhaustive parameter combination • Step-wise search of the whole solution space• Computationally demanding when the number of parameter is large

Parameter combination based on past evidence• Targeted search of restricted solution space

i.e., parameter ranges are estimated based on training data

Next-Generation Fusion Optimization approaches Non-linear Transformation function for Reranking Feature scores

• e.g. log transformation to compensate for the power law distribution of PageRank

Hybrid Fusion Optimization• Semi-automatic Dynamic Tuning • Automatic Fusion Optimization by Category

16AIRS2005

Results pool

Fetching result setsFor different categories

Automatic fusion optimization

performance gain> threshold?

Category 1Top 10 systems

Category nCategory 2Top system ineach query length

Yes

No

Automatic Fusion Optimization

optimized fusion formula

17AIRS2005

ResourcesWIDIT (Web Information Discovery Integrated Tool) Lab:http://widit.slis.indiana.edu/http://elvis.slis.indiana.edu/

Dynamic Tuning Interface (Web track)http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm

WIDIT projects TREC Web track

Dynamic Tuning Interface (HARD track)http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm

WIDIT projects TREC HARD track

Thank you! Questions?

http://widit.slis.indiana.edu/








http://elvis.slis.indiana.edu/








http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm








18AIRS2005

Length-Normalized Term Weights• SMART lnu weight for document terms• SMART ltc weight for query terms

where: fik = number of times term k appears in document iidfk = inverse document frequency of term kt = number of terms in document/query

Document Score• inner product of document and query vectors

where: qk = weight of term k in the querydik = weight of term k in document i

t = number of terms common to query & document

SMART

∑=

+

+=

t

jij

ikik

f

fd

1

2)1)(log(

1)log(

∑=

∗+

∗+=

t

jjj

kkk

idff

idffq

1

2])1)[(log(

)1)(log(

ik

t

kki dq∑

=

=1

Tdq

19AIRS2005

Document term weight(simplified formula)

Query term weight

Document Ranking

where: Q = query containing terms TK = k1 ((1-b) + b*(doc_length/avg.doc_length))tf = term frequency in a documentqtf = term frequency in a querytf = term frequency in a documentk1 , b, k3 = parameters (1.2, 0.75, 7..1000)wRS = Robertson-Sparck Jones weight

N = total number of documents in the collectionn = total number of documents in which the term occurR = total number of relevant documents in the collectionn = total number of relevant documents retrieved

Okapi

( )qtfk

qtfktfKtfkw

QTRS +

∗+++∑

∈ 3

31)1( )1(1

tfKtf

nnN

+⎟⎠⎞⎜⎝

⎛+

+−5.0

5.0log

( )qtfkqtfk

++

3

3 1

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

++−−+−+−

+

=

5.05.05.0

5.0

log

rRnNrnrR

r

WRS

20AIRS2005

URL Type (Tomlinson, 2003; Kraaij et al., 2002)• Heuristic

o root: slash_cnt=0 or (HP_end & slash_cnt=1)o subroot: HP_end & slash_cnt=2o path: HP_end & slash_cnt>=3o file: rest

(HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)

Page Type• Heuristic

if “welcome” or “home” in title, header, anchor text HPPelse if “YYYY” in title, anchor NPPelse if NP lexicon word NPelse if HP lexicon word HPelse if ends in all caps HPelse ??

• NP lexicono about, annual, report, guide, studies, history, new, how

• HP lexicono office, bureau, department, institute, center, committee, agency, administration, council,

society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry

Webpage Type Identification