Upload
rafe-lane
View
218
Download
0
Embed Size (px)
DESCRIPTION
3 AIRS2005 Introduction Web IR Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics Opportunities Diverse Sources of Evidence Data Abundance WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion Research Question What to combine? How to combine? Web IR Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics Opportunities Diverse Sources of Evidence Data Abundance WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion Research Question What to combine? How to combine?
Citation preview
Fusion-based Approach to Web Search Optimization
Kiduk Yang, Ning YuWIDIT Laboratory
SLIS, Indiana University
AIRS2005
2AIRS2005
OUTLINE Introduction
WIDIT in TREC Web Track
Results & Discussion
3AIRS2005
Introduction Web IR
Challenges• Size, Heterogeneity, Quality of data• Diversity of user tasks, interests, characteristics
Opportunities• Diverse Sources of Evidence• Data Abundance
WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion
Research Question What to combine? How to combine?
4AIRS2005
WIDIT in Web Track 2004 Data
Documents• 1.25 million .gov Web pages (18 GB).
Topics (i.e. queries)• 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)
Task Retrieve relevant documents given a mixed set of query types (QT)
Main Strategy Fusion of multiple data representations
• Static Tuning (QT-independent) Fusion of multiple sources of evidence
• Dynamic Tuning (QT-specific)
5AIRS2005
WIDIT: Web IR System ArchitectureWIDIT: Web IR System Architecture
Indexing Module
Sub-indexes
Body Index Anchor Index Header Index
Documents
Topics Queries
Simple Queries
Queries
Expanded Queries
Retrieval ModuleFusion Module
Sub-indexes Sub-indexes
Search Results
Re-ranking Module
Fusion Result
Final Result
Static Tuning
Dynamic Tuning
Query Classification
Module
Query Types
6AIRS2005
WIDIT: Indexing Module
Document Indexing1. Strip HTML tags
• extract title, meta keywords & description, emphasized words• parse out hyperlinks (URL & anchor texts)
2. Create Surrogate Documents• anchor texts of inlinks• header texts (title, meta text, emphasized text)
3. Create Subcollection Indexes• Stop & Stem (Simple, Combo stemmer)• compute SMART & Okapi term weights
4. Compute whole collection term statistics
Query Indexing• Stop & Stem• Identify nouns, phrases• Expand acronyms• Mine synonyms and definitions from Web search
7AIRS2005
WIDIT: Retrieval Module
1. Parallel Searching Multiple Document Index
• body text (title, body)• anchor text (title, inlink anchor text)• header text (title, meta kw & desc, first heading, emphasized words)
Multiple Query formulations• stemming (Simple, Combo)• expanded query (acronym, noun)
Multiple Subcollections• for search speed and scalability• search each subcollection using whole collection term statistics
2. Merge subcollection search results merge & sort by document score
8AIRS2005
WIDIT: Fusion Module
Fusion Formulas Weighted Sum
• FSws = (wi*NSi)
Select candidate systems to combine Top performers in each category
• e.g. best stemmer, qry expansion, doc index Diverse systems
• e.g. Content-based, Link-based One-time brute force combinations for validation
• Complementary Strength effect
Determine system weights (wi) Static Tuning
• Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data
• Select the formulas with best performance
wi = weight of system i (relative contribution of each system)
NSi = normalized score of a document by system i = (Si – Smin) / (Smax – Smin)
9AIRS2005
WIDIT: Query Classification Module Statistical Classification (SC)
Classifiers• Naïve Bayes• SVM
Training Data• Titles of 2003 topics (50 TD, 150 HP, 150 NP)• w/ and w/o stemming (Combo stemmer)
Training Data Enrichment for TD class• Added top-level Yahoo Government category labels
Linguistic Classification (LC) Word Cues
• Create HP and NP lexicons Ad-hoc heuristic
• e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic
Combination More ad-hoc heuristic
• if strong word cue, LCelse if single word, TDelse SC
10AIRS2005
WIDIT: Re-ranking Module Re-ranking Features
Field-specific Match• Query words, acronyms, phrases in URL, title, header, anchor text
Exact Match• title, header text, anchor text• body text
Indegree & Outdegree URL Type: root, subroot, path, file
• based on URL ending and slash count Page Type: HPP, HP, NPP, NP, ??
• based on word cue & heuristic
Re-ranking Formula Weighted sum of re-ranking features
Dynamic Tuning Dynamic/interactive optimization of the QT-specific re-ranking formula
11AIRS2005
WIDIT: Dynamic Tuning Interface WL
12AIRS2005
Dynamic Tuning: Observations Effective re-ranking factors
HP• indegree, outdegree, exact match, URL/pagetype• minimum number of outdegree =1
NP• indegree, outdegree, URLtype
o 1/3 impact of HP TD
• acronym, outdegree, URLtype• minimum number of outdegree =10
Strength Combines the human intelligence (pattern recognition) w/
computational power of machine Good for system tuning with many parameters Facilitates failure analysis
Weakness Over-tuning Sensitive to initial results & re-ranking parameter selection
13AIRS2005
Results Run Descriptions
Best fusion run: F3• 0.4*A + 0.3*F1 + 0.3*F2
where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header
Dynamic re-ranking runs• (DR_o) w/ official QT
Observations Dynamic tuning works well
• significant improvement over baseline (TD, HP) NP reranking needs to be optimized
• relatively small improvement by reranking
MAP (TD)
MRR (NP)
MRR (HP)
DR_o 0.1349(+38.5%)
0.6545(+ 6.7%)
0.6265(+47.2%)
F3 (baseline) 0.0974 0.6134 0.4256
TREC Median 0.1010 0.5888 0.5838
14AIRS2005
Discussion: Web IR Methods What worked?
Fusion• Combining multiple sources of evidence (MSE)
Dynamic Tuning• Helps multi-parameter tuning & failure analysis
What next? Expanded MSE mining
• Web server and search engine logs Enhanced Reranking Feature Selection & Scoring
• Modified PageRank/HITS• Link noise reduction based on page layout
Streamlined Fusion Optimization
15AIRS2005
Discussion: Fusion Optimization Conventional Fusion Optimization approaches
Exhaustive parameter combination • Step-wise search of the whole solution space• Computationally demanding when the number of parameter is large
Parameter combination based on past evidence• Targeted search of restricted solution space
i.e., parameter ranges are estimated based on training data
Next-Generation Fusion Optimization approaches Non-linear Transformation function for Reranking Feature scores
• e.g. log transformation to compensate for the power law distribution of PageRank
Hybrid Fusion Optimization• Semi-automatic Dynamic Tuning • Automatic Fusion Optimization by Category
16AIRS2005
Results pool
Fetching result setsFor different categories
Automatic fusion optimization
performance gain> threshold?
Category 1Top 10 systems
Category nCategory 2Top system ineach query length
Yes
No
Automatic Fusion Optimization
optimized fusion formula
17AIRS2005
ResourcesWIDIT (Web Information Discovery Integrated Tool) Lab:http://widit.slis.indiana.edu/http://elvis.slis.indiana.edu/
Dynamic Tuning Interface (Web track)http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm
WIDIT projects TREC Web track
Dynamic Tuning Interface (HARD track)http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm
WIDIT projects TREC HARD track
Thank you! Questions?
18AIRS2005
Length-Normalized Term Weights• SMART lnu weight for document terms• SMART ltc weight for query terms
where: fik = number of times term k appears in document iidfk = inverse document frequency of term kt = number of terms in document/query
Document Score• inner product of document and query vectors
where: qk = weight of term k in the querydik = weight of term k in document i
t = number of terms common to query & document
SMART
∑=
+
+=
t
jij
ikik
f
fd
1
2)1)(log(
1)log(
∑=
∗+
∗+=
t
jjj
kkk
idff
idffq
1
2])1)[(log(
)1)(log(
ik
t
kki dq∑
=
=1
Tdq
19AIRS2005
Document term weight(simplified formula)
Query term weight
Document Ranking
where: Q = query containing terms TK = k1 ((1-b) + b*(doc_length/avg.doc_length))tf = term frequency in a documentqtf = term frequency in a querytf = term frequency in a documentk1 , b, k3 = parameters (1.2, 0.75, 7..1000)wRS = Robertson-Sparck Jones weight
N = total number of documents in the collectionn = total number of documents in which the term occurR = total number of relevant documents in the collectionn = total number of relevant documents retrieved
Okapi
( )qtfk
qtfktfKtfkw
QTRS +
∗+++∑
∈ 3
31)1( )1(1
tfKtf
nnN
+⎟⎠⎞⎜⎝
⎛+
+−5.0
5.0log
( )qtfkqtfk
++
3
3 1
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
++−−+−+−
+
=
5.05.05.0
5.0
log
rRnNrnrR
r
WRS
20AIRS2005
URL Type (Tomlinson, 2003; Kraaij et al., 2002)• Heuristic
o root: slash_cnt=0 or (HP_end & slash_cnt=1)o subroot: HP_end & slash_cnt=2o path: HP_end & slash_cnt>=3o file: rest
(HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)
Page Type• Heuristic
if “welcome” or “home” in title, header, anchor text HPPelse if “YYYY” in title, anchor NPPelse if NP lexicon word NPelse if HP lexicon word HPelse if ends in all caps HPelse ??
• NP lexicono about, annual, report, guide, studies, history, new, how
• HP lexicono office, bureau, department, institute, center, committee, agency, administration, council,
society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry
Webpage Type Identification