Upload
gladys
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Type Less, Find More: Fast Autocompletion Search with a Succinct Index. at Google in Mountain View, USA, August 14. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber. It's useful …. Basic Autocompletion saves typing - PowerPoint PPT Presentation
Citation preview
Type Less, Find More:Fast Autocompletion Search
with a Succinct Index
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with Ingmar Weber
at Google in Mountain View, USA, August 14
Basic Autocompletion
– saves typing
– no more information than necessary
– find out about formulations used
googlism, googlearchy
– error correction
googel
It's useful …
It's more useful …
Complete to phrases
– phrase mountain view → add word mountain_view to index
Complete to subwords
– compound word eigenproblem → add word problem to index
Complete to category names
– author Edleno Moura → add moura:edleno::author edleno::moura:author
Faceted search
– add ct:conference:sigir
– add ct:author:edleno_moura
– add ct:year:2005
all via the same mechanism
Related Engines
Related Engines
Basic Problem Definition
Query
– a set D of documents (= hits for the first part of the query)
– a range W of words (= potential completions of last word)
Answer
– all documents D' from D, containing a word from W
– all words W' from W, contained in a document from D
Extensions (see paper at SIGIR'06)
– ranking (best hits from D' and best completions from W')
– positional information (proximity queries)
First try: inverted index (INV)
Processing 1-word queries with INV
For example, goog*
D all documents
W all words matching goog*
Iterate over all words from W
google Doc.18, Doc. 53, Doc. 591, ...
googlearchy Doc. 3, Doc. 66, Doc. 765, ...
googles Doc. 25, Doc. 98, Doc. 221, ...
googling Doc. 67, Doc. 189, Doc. 221, ...
googlism Doc. 16, Doc. 110, Doc. 141, ...
Merge the documents lists
D' Doc. 3, Doc. 16, Doc. 18, Doc. 25, …
Output all words from range as completions
W' google, googlearchy, googles, …
Expensive!
Trivialfor 1-word
queries
Processing multi-word queries with INV
For example, goog* mou*
D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for goog*)
W all words matching mou*
Iterate over all words from W
mould Doc. 8, Doc. 23, Doc. 291, ...
mount Doc. 24, Doc. 36, Doc. 165, ...
mountain Doc. 3, Doc. 18, Doc. 66, ...
mounting Doc. 56, Doc. 129, Doc. 251, ...
moura Doc. 18, Doc. 21, Doc. 25, ...
Intersect each list with D, then merge
D' Doc. 3, Doc. 18, Doc. 25, …
Output all words with non-empty intersection
W' mountain, moura
Most intersection are empty, but
INV has to compute them
all!
INV — Problems
Asymptotic time complexity is bad (for our problem)
– many intersections (one per potential completion)
– has to merge/sort (the non-empty intersections)
Still hard to beat INV in practice
– highly compressible
half the space on disk means half the time to read it
– INV has very good locality of access
the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory
– simple code
instruction cache, branch prediction, etc.
A Hybrid Index (HYB)
But this looks very wasteful
Basic Idea: have lists for ranges of words
mould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ...
Problem: not enough to show completions
Solution: store the word(s) along with each doc idmould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ...
mould moura mount mould
mountain mounting moura
HYB — Details
HYB has a block for each word range, conceptually:
Replace doc ids by gaps and words by frequency ranks:
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A
+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st
Encode both gaps and ranks such that x log2 x bits
+0 0 +1 10 +2 110
1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
An actual block of HYB
How well does it compress? Which block size?
INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is
Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
MEDICINE44,015 docs
263,817 wordswith positions
WIKIPEDIA2,866,503 docs
6,700,119 words
with positions
TREC .GOV25,204,013 docs
25,263,176 words
no positions
raw size 452 MB 7.4 GB 426 GB
INV 13 MB 0.48 GB 4.6 GB
HYB 14 MB 0.51 GB 4.9 GB
Nice match of theory and practice
ni = number of documents containing i-th word, n = number of
documents
INV vs. HYB — Query Time
MEDICINE44,015 docs
263,817 words5,732 real queries
with proximity
avg : 0.03 secsmax: 0.38 secs
avg : .003 secsmax: 0.06 secs
INV
HYB
WIKIPEDIA2,866,503 docs
6,700,119 words100 random queries
with proximity
avg : 0.17 secsmax: 2.27 secs
avg : 0.05 secsmax: 0.49 secs
Theoretical analysis see paper at SIGIR'06
Experiment: type ordinary queries from left to right
– go , goo , goog , googl , google , google mo , google mou , ...
TREC .GOV25,204,013 docs
25,263,176 words50 TREC queries
no proximity
avg : 0.58 secsmax: 16.83 secs
avg : 0.11 secsmax: 0.86 secs
HYB better by an order of magnitude
System Design — High Level View
Debugging such an application is hell!
Compute ServerC++
Web ServerPHP
User ClientJavaScript
Summary of Results
Properties of HYB
– highly compressible (just like INV)
– fast prefix-completion queries (perfect locality of access)
– fast indexing (no full inversion necessary)
Autocompletion and more
– phrase and subword completion, semantic completion, XML support, …
– faceted search (Workshop Talk on Thursday)
– efficient DB joins: author[sigir sigmod]NEW
all with one and the same (efficient) mechanism