View
607
Download
0
Category
Preview:
Citation preview
Query Expansion with Locally-Trained Word Embeddings
Fernando Bhaskar Mitra Nick CraswellMicrosoft
p(d)
d
p(d)
d
q
p(d|q)
cutglobal local*cutting taxsqueeze deficitreduce voteslash budget
reduction reductionspend houselower billhalve plansoften spendfreeze billion
global: trained using full corpus
local: trained using topically-
*gas
global local
t-SNE projection: top words by p̃(d|q) (blue: query; red: top words by p(d|q))
• local term clustering [Lesk 1968, Attar and Fraenkel 1977]
• local latent semantic analysis [Hull 1995, Hull, 1994; Schutze et al., 1995; Singhal et al., 1997]
• local document clustering [Tombros and van Rijsbergen, 2001; Tombros et al., 2002; Willett, 1985]
• one sense per discourse [Gale et al., 1992]
targetcorpus
query
results
q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]
query = gas tax
q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]
query = gas tax
d = [gas:0.0 tax:0.0 petroleum:0.7 tariff:0.5 …]
q = [gas:1.0 tax:1.0 petroleum:0.0 tariff:0.0 …]
query = gas tax
… gas petroleum:0.9 indigestion:0.6 … tax tariff:0.7 strain:0.4 … …[ ]W=
q = [gas:1.0 tax:1.0 petroleum:0.8 tariff:0.6 …]
query = gas tax
d = [gas:0.0 tax:0.0 petroleum:0.7 tariff:0.5 …]
W = UUT
U m⇥ k embedding matrix
p(d)
d
q
p(d|q)
p(d)
d
q
p̃(d|q)
targetcorpus
query
results
externalcorpus
query
results
U =
8>>><
>>>:
uniform p(d) on the target corpus
uniform p(d) on an external corpus
p(d|q) on the target corpus
p(d|q) on an external corpus
docs words queries
trec12 469,949 438,338 150
robust 528,155 665,128 250
web 50,220,423 90,411,624 200
global local
target target
wikipedia+gigaword* gigaword†
google news* wikipedia†
*publicly available embedding; †publicly available external corpus
targetcorpus
query
results
externalcorpus
query
results
targetcorpus
query
results
targetcorpus
query
results
externalcorpus
query
results
trec12 robust web
local vs global
NDCG@10
0.0
0.1
0.2
0.3
0.4
0.5
expansion
nonegloballocal
trec12 robust web
local embedding
NDCG@10
0.0
0.1
0.2
0.3
0.4
0.5
corpus
targetgigawordwikipedia
• local embedding provides a stronger representation than global embedding
• potential impact for other topic-specific natural language processing tasks
• future work
• effectiveness improvements
• efficiency improvements
Recommended