Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More:Fast Autocompletion Search

with a Succinct Index

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with Ingmar Weber

at Google in Mountain View, USA, August 14

Basic Autocompletion

– saves typing

– no more information than necessary

– find out about formulations used

googlism, googlearchy

– error correction

googel

It's useful …

It's more useful …

Complete to phrases

– phrase mountain view → add word mountain_view to index

Complete to subwords

– compound word eigenproblem → add word problem to index

Complete to category names

– author Edleno Moura → add moura:edleno::author edleno::moura:author

Faceted search

– add ct:conference:sigir

– add ct:author:edleno_moura

– add ct:year:2005

all via the same mechanism

Related Engines

Related Engines

Basic Problem Definition

Query

– a set D of documents (= hits for the first part of the query)

– a range W of words (= potential completions of last word)

Answer

– all documents D' from D, containing a word from W

– all words W' from W, contained in a document from D

Extensions (see paper at SIGIR'06)

– ranking (best hits from D' and best completions from W')

– positional information (proximity queries)

First try: inverted index (INV)

Processing 1-word queries with INV

For example, goog*

D all documents

W all words matching goog*

Iterate over all words from W

google Doc.18, Doc. 53, Doc. 591, ...

googlearchy Doc. 3, Doc. 66, Doc. 765, ...

googles Doc. 25, Doc. 98, Doc. 221, ...

googling Doc. 67, Doc. 189, Doc. 221, ...

googlism Doc. 16, Doc. 110, Doc. 141, ...

Merge the documents lists

D' Doc. 3, Doc. 16, Doc. 18, Doc. 25, …

Output all words from range as completions

W' google, googlearchy, googles, …

Expensive!

Trivialfor 1-word

queries

Processing multi-word queries with INV

For example, goog* mou*

D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for goog*)

W all words matching mou*

Iterate over all words from W

mould Doc. 8, Doc. 23, Doc. 291, ...

mount Doc. 24, Doc. 36, Doc. 165, ...

mountain Doc. 3, Doc. 18, Doc. 66, ...

mounting Doc. 56, Doc. 129, Doc. 251, ...

moura Doc. 18, Doc. 21, Doc. 25, ...

Intersect each list with D, then merge

D' Doc. 3, Doc. 18, Doc. 25, …

Output all words with non-empty intersection

W' mountain, moura

Most intersection are empty, but

INV has to compute them

all!

INV — Problems

Asymptotic time complexity is bad (for our problem)

– many intersections (one per potential completion)

– has to merge/sort (the non-empty intersections)

Still hard to beat INV in practice

– highly compressible

half the space on disk means half the time to read it

– INV has very good locality of access

the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory

– simple code

instruction cache, branch prediction, etc.

A Hybrid Index (HYB)

But this looks very wasteful

Basic Idea: have lists for ranges of words

mould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ...

Problem: not enough to show completions

Solution: store the word(s) along with each doc idmould – moura Doc. 3 , Doc. 16 , Doc.18 , Doc. 25 , ...

mould moura mount mould

mountain mounting moura

HYB — Details

HYB has a block for each word range, conceptually:

Replace doc ids by gaps and words by frequency ranks:

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x log2 x bits

+0 0 +1 10 +2 110

1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110

10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0

An actual block of HYB

How well does it compress? Which block size?

INV vs. HYB — Space Consumption

Theorem: The empirical entropy of INV is

Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is

Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))

MEDICINE44,015 docs

263,817 wordswith positions

WIKIPEDIA2,866,503 docs

6,700,119 words

with positions

TREC .GOV25,204,013 docs

25,263,176 words

no positions

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Nice match of theory and practice

ni = number of documents containing i-th word, n = number of

documents

INV vs. HYB — Query Time

MEDICINE44,015 docs

263,817 words5,732 real queries

with proximity

avg : 0.03 secsmax: 0.38 secs

avg : .003 secsmax: 0.06 secs

INV

HYB

WIKIPEDIA2,866,503 docs

6,700,119 words100 random queries

with proximity



Theoretical analysis see paper at SIGIR'06

Experiment: type ordinary queries from left to right

– go , goo , goog , googl , google , google mo , google mou , ...

TREC .GOV25,204,013 docs

25,263,176 words50 TREC queries

no proximity



HYB better by an order of magnitude

System Design — High Level View

Debugging such an application is hell!

Compute ServerC++

Web ServerPHP

User ClientJavaScript

Summary of Results

Properties of HYB

– highly compressible (just like INV)

– fast prefix-completion queries (perfect locality of access)

– fast indexing (no full inversion necessary)

Autocompletion and more

– phrase and subword completion, semantic completion, XML support, …

– faceted search (Workshop Talk on Thursday)

– efficient DB joins: author[sigir sigmod]NEW

all with one and the same (efficient) mechanism

Documents

Type Less, Find More: Fast Autocompletion Search with a Succinct Index