Upload
yunyao-li
View
93
Download
0
Embed Size (px)
DESCRIPTION
Presented by Yunyao at SIGIR 2012
Citation preview
Automatic Suggestion of Query-Automatic Suggestion of Query-Rewrite Rules for Enterprise SearchRewrite Rules for Enterprise Search
Benny KimelfeldIBM Research – Almaden
Zhuowei BaoUniversity of Pennsylvania
Yunyao LiIBM Research – Almaden
Portland, Oregon, USAPortland, Oregon, USASIGIR 2012SIGIR 2012
2
Challenges in Enterprise SearchChallenges in Enterprise Search
Network Station Manager searchThin Client Manager Product names
change over night:
Continually changing terminology
Domain-specific meaning
Paula Summa searchbring Paula Summa
from employee directories
per diem search
Domain-specific redundancy
popcorn searchconference
call!
• Result 1: IBM Travel: Per Diem• Result 2: IBM Travel: Per Diem Rates • Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
…
An enterprise search engine is managed by admins who are domain experts. Not search
experts!
An enterprise search engine is managed by admins who are domain experts. Not search
experts!
3
Programmable Search in IBMProgrammable Search in IBM
• Programmable Search: A philosophy and design of enterprise search [Vaithyanathan, SIGIR’11]
– Backend analysis includes Information Extraction, categorization, and domain-specific variant generation [Zhu et al, WWW’07]
– Search programmable by runtime rules [Agarwal et al., WWW’10, Szpektor et al., WWW’11, Fagin et al., PODS’10/11]
• Gumshoe, a Programmable Search engine, today powering IBM internal & external portals
Background: IBM deployed traditional, black-box search
solutions Quality degraded over time Exposed knobs were insufficient and opaque
Background: IBM deployed traditional, black-box search
solutions Quality degraded over time Exposed knobs were insufficient and opaque
4
Engine Architecture & RulesEngine Architecture & Rules
Query rewritingQuery rewriting
Result aggregationResult aggregation
Front endBack end
query
final results
set of new queries
rankedresults
reportcomplaints
User
Search admin
Index
Runtime rules
Grouping Rules• Group results of specified categories• Necessary—domain specific redundancy
Grouping Rules• Group results of specified categories• Necessary—domain specific redundancy
Re-ranking Rules• Re-rank results by specified categories• Semantics based “top” and “bottom” matches
Re-ranking Rules• Re-rank results by specified categories• Semantics based “top” and “bottom” matches
Rewrite Rules• Create new queries to augment/replace the
original query
Rewrite Rules• Create new queries to augment/replace the
original query
Focus here
Matched against the query
Perform when matchQuery pattern → Action
5
Rewrite Rules in GumshoeRewrite Rules in Gumshoe
• EQUALS: $x [in PRODUCT] info → $x
• CONTAINS: lotus $x(presentations|spreadsheets) → lotus symphony
• CONTAINS: msn search →+ bing
• Similar to the query-template rules of Agarwal et al. [WWW 2010]
$x is in the product
dictionary $x matches a regex
prefer the new query over old
• The only type of rules considered here• Abbreviation: s → t
6
Aiding AdministratorsAiding Administrators
Bad results for query …
I’m missing the golden URL…
Result 22 should be
ranked much higher!
Enterprise Users
Query LogsQuery “global campus” seems
unsatisfying
• Terminology mismatch: user queries vs. docs.
– Rule needed
• The rule should push desired results up the top
• Devise• Deploy• Test
• How are other cases affected by my new rule?
– Revisit my old rules?
• Terminology mismatch: user queries vs. docs.
– Rule needed
• The rule should push desired results up the top
• Devise• Deploy• Test
• How are other cases affected by my new rule?
– Revisit my old rules?
Search Admin
This paper
7
Gumshoe Maintenance ToolkitGumshoe Maintenance Toolkit
CIKM 2012 Demo[Bao et al.]
• Introduction & BackgroundIntroduction & Background
• Suggesting Natural RulesSuggesting Natural Rules
• Optimizing Rule SelectionOptimizing Rule Selection
• Concluding RemarksConcluding Remarks
OutlineOutlineOutlineOutline
Single rule setting
Multiple-rule setting
9
Problem DescriptionProblem Description
• Input: Example (q,d) of a query and a desired match
• Goal: Devise an effective and natural rewrite rule
• Effective: push the desired match up the top
• Natural: should correspond to a semantically coherent replacement of terms• The kind of rules the administrator would devise herself
seasonal flu → avian flu
management change → SCIP
download → ISSI tool
Temporarily correct
Temporarily correct
Organization reconstructio
n
Organization reconstructio
n
Main software access for IBMers
Main software access for IBMers
10
AlgorithmAlgorithmInput: Query q, desired match (doc/URL) d
Candidates for sn-grams of q
Candidates for sn-grams of q
Candidates for tn-grams of high-quality
fields of d
Candidates for tn-grams of high-quality
fields of d
Output: Suggested rewrite rules s → t
X
Candidates for s → tCandidates for s → t
11
High-Quality Fields of a High-Quality Fields of a DocumentDocument
HTML title
URL (fragments)
Visual title
12
But… Many Candidates But… Many Candidates seasonal flu
seasonal flu seasonal flu seasonal flu
seasonal seasonal seasonal seasonal
⁞
→→→→→→→→
avian fluh5n1flu employeeand IBM h5n1avianh5n1fluyou and IBM
⁞
change management change management change management change management change management
managementmanagement
⁞
→→→→→→→
strategy & change internal practicewelcome to strategywelcome strategystrategy change & internalto strategy change internal practiceinternal management index pagesInternal practice scip
⁞
We often get ≈ 100 candidates, sometimes ≈ 1000
13
AlgorithmAlgorithmInput: Query q, desired match (doc/URL) d
X
Candidates for s → tCandidates for s → t
Classifiernatural/unnatural rules
Classifiernatural/unnatural rules
Output: Suggested rewrite rules s → t
Next:
Effectiveness filterEffectiveness filter
Candidates for sn-grams of q
Candidates for sn-grams of q
Candidates for tn-grams of high-quality
fields of d
Candidates for tn-grams of high-quality
fields of d
14
Classification FeaturesClassification Features
• Syntactic features– Whether s (resp., t) begins with a stop word– Whether s (resp., t) ends with a stop word– Number of tokens in s (resp., t)
• Corpus statistics– Logarithm of the frequency of s (resp., t)– Logarithm of the concurrence frequency of s and t– Logarithm of the frequency of s (resp., t) in titles
• Query-log statistics– Logarithm of the s-to-t reformulation frequency
Rule: s → t
15
Classification ModelsClassification Models• We take an approach similar to Kraft & Zien [2004]
that explored a problem of a similar flavor
• SVM: a linear classifier
• rDTLC: Decision Tree with Linear-Combination splits [Loh & Shih,1988]– Bound the tree depth (3 in our implementation)– Use univariate splits on non-leaf nodes
fi0 < τ0 ?
fi1 < τ1 ? fi2 < τ2 ?
∑aifi < τ3 ? ∑bifi < τ4 ? ∑cifi < τ5 ? ∑difi < τ6 ?
Yes No Yes No Yes No Yes No
16
Experimental SettingExperimental Setting
• Experiments over IBM Intranet search
• 1894 suggested matches (q,d) provided by IBM CIO Office– These are usually matches for highly frequent
queries– 11907 effective candidate rules generated
• “Effective” = pushes d from outside top k for q to inside
• Manually labeled ~1200 candidate rewrite rules
17
0.2
0.4
0.6
0.8
1
Experimental ResultsExperimental Results
60
65
70
75
80
85
90
95
100SVM
DecisionTree/SVM
Rules weighted by query
frequency
+ weighted training
Random
Classification Accuracy
Ranking by Classifier Score (MRR)
0.2
0.4
0.6
0.8
1
top-1 top-3 top-5
Ranking by Classifier Score (nDCGk)
• Introduction & BackgroundIntroduction & Background
• Suggesting Natural RulesSuggesting Natural Rules
• Optimizing Rule SelectionOptimizing Rule Selection
• Concluding RemarksConcluding Remarks
OutlineOutlineOutlineOutline
19
stock → stock marketspreadsheet → symphony
stock → stock marketspreadsheet → symphonyOptimizing Rule SelectionOptimizing Rule Selection
symphony tutorial spreadsheet tutorial spreadsheet tutorial
stock spreadsheet stock market spreadsheetstock spreadsheet
stock symphony
Negative effect
• A rule can negatively affect performance on desired matches
• A rule can interfere with other rules
• Idea: Optimize rule selection
excel spreadsheet excel spreadsheetexcel symphony
20
Formal Optimization ProblemFormal Optimization Problem
q1q1
q2q2
q3q3
...
qnqn
p1p1
p2p2
p3p3
...
pnpn
. . .
s1 s2s3
s5s4 s6
Que
ries
Que
ries
Rew
ritten queries
Rew
ritten queries
DocumentsDocuments
ScoresScores
r1
r2
r3,r9r4
r4
r6,r8
Rewrite RulesRewrite Rules
(qi):desired doc. matches for qi
topk(qi):k docs. reachable w/ highest score
( (qi) , topk(qi) ) Quality measure per qii=
1
nGoal: Find a subset of the rewrite rules that
maximizes
21
Hardness & HeuristicsHardness & Heuristics
We propose 2 simple heuristic algorithms:
Theorem:
• Finding an optimal set of rules is NP-hard• So is finding any constant-factor approx.• Holds already for k=1• Holds for every quality measure (e.g., DCG, precision@k,
etc.), assuming a very basic well-behavior property• Reduction from maximal independent set
22
Greedy Algorithms (High Level) Greedy Algorithms (High Level)
Globally Greedy AlgorithmGlobally Greedy AlgorithmR ← empty set
While(change) {
r ← rule w/ max quality for R+{r}
If(R+{r} is better than R) {
R ← R+{r}
}
}
Return R
Globally Greedy AlgorithmGlobally Greedy AlgorithmR ← empty set
While(change) {
r ← rule w/ max quality for R+{r}
If(R+{r} is better than R) {
R ← R+{r}
}
}
Return R
Locally Greedy AlgorithmLocally Greedy AlgorithmR ← empty set
For all benchmark pairs (q,d) {
REL ← the rules relevant to (q,d)
r ← rule w/ max quality for R+{r} among REL
If(R+{r} is better than R) {
R ← R+{r}
}
}
Return R
Locally Greedy AlgorithmLocally Greedy AlgorithmR ← empty set
For all benchmark pairs (q,d) {
REL ← the rules relevant to (q,d)
r ← rule w/ max quality for R+{r} among REL
If(R+{r} is better than R) {
R ← R+{r}
}
}
Return R
In the paper: weighted versions + running-time optimizations
23
0.40.50.60.70.80.9
1
top-1 top-3 top-5
Experiments over IBM Intranet Experiments over IBM Intranet SearchSearch
All Rules
Achieved accuracy for nDCGk (unweighted)
Random (L) Random (G) Greedy (L) Greedy (G) Bound
Baselines
0.4
0.5
0.6
0.7
0.8
0.9
1
Benchmark added
Benchmark added
Achieved accuracy for MRR
WeightedWeighted
More in the paper:
• Additional combinations measure+weight+data
• Large number of rules
• Running times
Local is 2 orders of magnitudes faster than Global
• Introduction & BackgroundIntroduction & Background
• Suggesting Natural RulesSuggesting Natural Rules
• Optimizing Rule SelectionOptimizing Rule Selection
• Concluding RemarksConcluding Remarks
OutlineOutlineOutlineOutline
25
Summary and Future WorkSummary and Future Work• In programmable search, domain knowledge of the
enterprise is introduced by means of rules• Studied 2 problems of facilitating rule management
– Suggesting natural rules• Candidate generation, classifier for identifying natural rules
– Optimizing rule selection• Unfortunately, the problem quickly gets NP-hard• Presented simple heuristics + optimizations thereof
• Conducted experiments over real data from IBM Intranet search, provided by IBM search administrators
• CIKM 2012 demo• Various challenges remain for future work
– Improving the quality and efficiency of rule suggestion• In particular, indexing, “learning to rank”
– Extending the framework into a richer class of rules• Using dictionaries, regular expressions, etc. Questions?Questions?