117
Elias-Fano Encoding Giulio Ermanno Pibiri [email protected] Computer Science Department University of Pisa 21/06/2016 Succinct representation of monotone integer sequences with search operations 1

Elias-Fano Encoding

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Elias-Fano Encoding

Elias-Fano Encoding

Giulio Ermanno Pibiri [email protected]

Computer Science Department University of Pisa

21/06/2016

Succinct representation of monotone integer sequences with search operations

1

Page 2: Elias-Fano Encoding

2

Problem

Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i-1] ≤ S[i] for 1 ≤ i ≤ n-1, possibly repeated.

How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits?

Page 3: Elias-Fano Encoding

2

Problem

Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i-1] ≤ S[i] for 1 ≤ i ≤ n-1, possibly repeated.

How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits?

Huge research corpora describing different space/time trade-offs.

• Elias gamma/delta [Salomon-2007] • Variable Byte [Salomon-2007] • Varint-G8IU [Stepanov et al.-2011] • Simple-9/16 [Anh and Moffat 2005-2010] • PForDelta (PFD) [Zukowski et al.-2006] • OptPFD [Yan et al.-2009] • Binary Interpolative Coding [Moffat and Stuiver-2000]

Page 4: Elias-Fano Encoding

3

Inverted Indexes

For each term t in T we store in a list Lt the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {Lt1,…,LtT} is the inverted index.

1

Page 5: Elias-Fano Encoding

3

Inverted Indexes

For each term t in T we store in a list Lt the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {Lt1,…,LtT} is the inverted index.

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

1

Page 6: Elias-Fano Encoding

3

Inverted Indexes

For each term t in T we store in a list Lt the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {Lt1,…,LtT} is the inverted index.

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

1

Page 7: Elias-Fano Encoding

3

Inverted Indexes

For each term t in T we store in a list Lt the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {Lt1,…,LtT} is the inverted index.

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

21

3

45

T = {always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

1

Page 8: Elias-Fano Encoding

3

Inverted Indexes

For each term t in T we store in a list Lt the identifiers of the documents in which t appears.

Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T.

The collection of all inverted lists {Lt1,…,LtT} is the inverted index.

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

21

3

45

T = {always, boy, good, house, hungry, is, red, the}t1 t2 t3 t4 t5 t6 t7 t8

1

Lt1=[1, 3]Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

Page 9: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

Page 10: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

Page 11: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

Page 12: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

Page 13: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

Page 14: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

q = {good, hungry}

Page 15: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

q = {good, hungry}

Page 16: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

q = {good, hungry}

Page 17: Elias-Fano Encoding

4

Inverted Indexes

Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk} occur”.

2

q = {boy, is, the}

houseis

red

redis

alwaysgood

the

the

isboy

hungryis

boy

redhouseis

the

alwayshungry

T = {always, boy, good, house, hungry, is, red, the}21

3

45

Lt1=[1, 3]

t1 t2 t3 t4 t5 t6 t7 t8

Lt2=[4, 5]Lt3=[1]Lt4=[2, 3]Lt5=[3, 5]Lt6=[1, 2, 3, 4, 5]Lt7=[1, 2, 4]Lt8=[2, 3, 5]

q = {good, hungry}

inverted lists intersection

Page 18: Elias-Fano Encoding

5

Genesis - 1970s

Peter Elias [1923 - 2001]

Robert Fano [1917 -]

Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).

Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, 246–260 (1974).

Page 19: Elias-Fano Encoding

5

Genesis - 1970s

Peter Elias [1923 - 2001]

Robert Fano [1917 -]

Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).

Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, 246–260 (1974).

Sebastiano Vigna. Quasi-succinct indices.

In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

40 years later!

Page 20: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143

1

2

3

4

5

6

7

8

Page 21: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143u =

1

2

3

4

5

6

7

8

Page 22: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143u =

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 23: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 24: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 25: Elias-Fano Encoding

6

Elias-Fano solution

3471314152143

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 26: Elias-Fano Encoding

3

6

Elias-Fano solution

3471314152143

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 27: Elias-Fano Encoding

3

3

6

Elias-Fano solution

3471314152143

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 28: Elias-Fano Encoding

1

3

3

6

Elias-Fano solution

3471314152143

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 29: Elias-Fano Encoding

1

1

3

3

6

Elias-Fano solution

3471314152143

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 30: Elias-Fano Encoding

1

1

3

3

6

Elias-Fano solution

3471314152143

0

0

0 1 11 0 0

missingbuckets

L = 011100111101110111101011

u =

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 31: Elias-Fano Encoding

1

1

3

3

6

Elias-Fano solution

3471314152143

0

0

0 1 11 0 0

missingbuckets

L = 011100111101110111101011

u =1 1 01 1 1

0

0

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 32: Elias-Fano Encoding

1

1

3

3

6

Elias-Fano solution

3 3 1 0 0 1 0 0

3471314152143

0

0

0 1 11 0 0

missingbuckets

L = 011100111101110111101011

u =1 1 01 1 1

0

0

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 33: Elias-Fano Encoding

1

1

3

3

6

Elias-Fano solution

3 3 1 0 0 1 0 0

3471314152143

0

0

0 1 11 0 0

missingbuckets

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

u =1 1 01 1 1

0

0

high lowlg n lg(u/n)

1

2

3

4

5

6

7

8

0 0 0 0 1 10 0 0 1 0 00 0 0 1 1 10 0 1 1 0 10 0 1 1 1 00 0 1 1 1 10 1 0 1 0 11 0 1 0 1 1

Page 34: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Page 35: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

Page 36: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

Page 37: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

n ones

Page 38: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

We store a 0 whenever we change bucket.

n ones

Page 39: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

We store a 0 whenever we change bucket.

n ones2 lg n zeros

Page 40: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn bits+ 2n

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

We store a 0 whenever we change bucket.

n ones2 lg n zeros

Page 41: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn bits+ 2n

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

n ones2 lg n zeros

Page 42: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn bits+ 2n

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

n ones

Page 43: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn bits+ 2n

H = 1110 1110 10 0 0 10 0 0L = 011100111101110111101011

lg(u/n)

Page 44: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

unlgn bits+ 2n

Page 45: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Page 46: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

Page 47: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

Page 48: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

000000000000000000

Page 49: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003

Page 50: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003 6

000100100000000000

Page 51: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003 6

00010010000000000010

000100100010000000

Page 52: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003 6

00010010000000000010

00010010001000000011

000100100011000000

Page 53: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003 6

00010010000000000010

00010010001000000011

00010010001100000017

000100100011000001

Page 54: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003 6

00010010000000000010

00010010001000000011

00010010001100000017

000100100011000001

With possible repetitions! (weak monotonicity)

Page 55: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003

= ( (u+nn

6000100100000000000

10000100100010000000

11000100100011000000

17000100100011000001

With possible repetitions! (weak monotonicity)

Page 56: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003

= ( (u+nn

6000100100000000000

10000100100010000000

11000100100011000000

17000100100011000001 lg ≈ u+n

nlgn( (u+nn

With possible repetitions! (weak monotonicity)

Page 57: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003

= ( (u+nn

6000100100000000000

10000100100010000000

11000100100011000000

17000100100011000001

unlgn + 2n

lg ≈ u+nnlgn( (u+n

nWith possible repetitions!

(weak monotonicity)

Page 58: Elias-Fano Encoding

7

Properties - Space

EF(S[0,n)) = ?

1

Is it good or not?

unlgn bits+ 2n

Information Theoretic Lower BoundThe minimum number of bits needed to describe a set X is

lg X bits.

X is the set of all monotone sequence of length n drawn from a universe u.

?X

0000000000000000000001000000000000003

= ( (u+nn

(less than half a bit away [Elias-1974])

6000100100000000000

10000100100010000000

11000100100011000000

17000100100011000001

unlgn + 2n

lg ≈ u+nnlgn( (u+n

nWith possible repetitions!

(weak monotonicity)

Page 59: Elias-Fano Encoding

8

Properties - Operations 2

Page 60: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

2

Page 61: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

2

predecessor(x) = max{S[i] | S[i] < x}successor(x) = min{S[i] | S[i] ≥ x}

un

lg( (O worst-casequeries in

Page 62: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

2

predecessor(x) = max{S[i] | S[i] < x}successor(x) = min{S[i] | S[i] ≥ x}

un

lg( (O worst-casequeries in

Page 63: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

but…

2

predecessor(x) = max{S[i] | S[i] < x}successor(x) = min{S[i] | S[i] ≥ x}

un

lg( (O worst-casequeries in

Page 64: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

they need o(n) bits more space in order to support fast rank/select primitives on bitvector H

but…

2

predecessor(x) = max{S[i] | S[i] < x}successor(x) = min{S[i] | S[i] ≥ x}

un

lg( (O worst-casequeries in

Page 65: Elias-Fano Encoding

8

Properties - Operations

access to each S[i] in O(1) worst-case

they need o(n) bits more space in order to support fast rank/select primitives on bitvector H

but…

2

predecessor(x) = max{S[i] | S[i] < x}successor(x) = min{S[i] | S[i] ≥ x}

un

lg( (O worst-casequeries in

Page 66: Elias-Fano Encoding

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

1

Page 67: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

1

Page 68: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

rank0(5) = 2

1

Page 69: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

rank0(5) = 2rank1(7) = 4

1

Page 70: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

rank0(5) = 2rank1(7) = 4

select0(5) = 10

1

Page 71: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

rank0(5) = 2rank1(7) = 4

select0(5) = 10select1(7) = 11

1

Page 72: Elias-Fano Encoding

B = 101011010101111010110101Examples

rank0/1(i) = # of 0/1 in [0,i)Given a bitvector B of n bits:

select0/1(i) = position of i-th 0/1

Succinct rank/select

9

Definition

rank0(5) = 2rank1(7) = 4

select0(5) = 10select1(7) = 11

rank0/1(select0/1(i)) = i-1rank0/1(i) + rank1/0(i) = i

rank1/0(select0/1(i)) = select0/1(i) - iRelations

1

Page 73: Elias-Fano Encoding

10

Succinct rank/select

O(1)-solutions with o(n) bits

(multi)-layered index + precomputed table

three-level directory tree

rank

select

[Jacobson-1989]

[Clark-1996]

2

Page 74: Elias-Fano Encoding

10

Succinct rank/select

O(1)-solutions with o(n) bits

(multi)-layered index + precomputed table

three-level directory tree

rank

select

[Jacobson-1989]

[Clark-1996]

230 bits ~67% more bits!

2

Page 75: Elias-Fano Encoding

10

Succinct rank/select

O(1)-solutions with o(n) bits

(multi)-layered index + precomputed table

three-level directory tree

rank

select

[Jacobson-1989]

[Clark-1996]

230 bits ~67% more bits!

230 bits ~60% more bits!

2

Page 76: Elias-Fano Encoding

10

Succinct rank/select

O(1)-solutions with o(n) bits

(multi)-layered index + precomputed table

three-level directory tree

rank

select

[Jacobson-1989]

[Clark-1996]

230 bits ~67% more bits!

230 bits ~60% more bits!

Nowadays practical solutions are based on [Vigna-2008, Zhou et al.-2013]: • broadword programming • interleaving • Intel hardware popcnt instruction:

Long().bitCount(x) in Java

__builtin_popcountl(x) in C/C++

2

Page 77: Elias-Fano Encoding

10

Succinct rank/select

O(1)-solutions with o(n) bits

(multi)-layered index + precomputed table

three-level directory tree

rank

select

[Jacobson-1989]

[Clark-1996]

230 bits ~67% more bits!

230 bits ~60% more bits!

rank ~3% more bitsselect ~0.39% more bits

with practical constant-time selection

Nowadays practical solutions are based on [Vigna-2008, Zhou et al.-2013]: • broadword programming • interleaving • Intel hardware popcnt instruction:

Long().bitCount(x) in Java

__builtin_popcountl(x) in C/C++

2

Page 78: Elias-Fano Encoding

11

access example

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 79: Elias-Fano Encoding

11

access example

access(4) = S[4] = ?

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 80: Elias-Fano Encoding

11

access example

access(4) = S[4] = ?

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

Page 81: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

Page 82: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

Page 83: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

access(i) = select1(i)

Page 84: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)

Page 85: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)

Page 86: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)

Page 87: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

Page 88: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 89: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 90: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

access(7) = S[7] = ?

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 91: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

access(7) = S[7] = ?

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 92: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

access(7) = S[7] = ?010000

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 93: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

access(7) = S[7] = ?010000101

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Page 94: Elias-Fano Encoding

11

access example

access(4) = S[4] = ? Recall: we store a 0 whenever we change bucket.

001000

select1(i) - i=

access(7) = S[7] = ?010000101

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

H = 1110111010001000L = 011100111101110111101011k = lg(u/n)

rank0( )access(i) = select1(i)select1(i) - i

101

<< k | L[(i-1)k,ik)

Complexity: O(1)

Page 95: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 96: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 97: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 98: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 99: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 100: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 101: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 102: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

p1 p2

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 103: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

p1 p2

binary search in [p1,p2)

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 104: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

13 p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

p1 p2

binary search in [p1,p2)

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Page 105: Elias-Fano Encoding

H = 1110111010001000L = 011100111101110111101011

12

successor example

successor(12) = ?001100h12 =

13 p1 = select0(hx)-hxp2 = select0(hx+1)-hx-1

p1 p2

binary search in [p1,p2)

S = [3, 4, 7, 13, 14, 15, 21, 43]1 2 3 4 5 6 7 8

Complexity: unlg( (O

Page 106: Elias-Fano Encoding

13

Performance 1

n u access successor iterated successor iterator

~2.4x106 ~1.76x109 27.6 ns 0.24 µs 7.61 ns 2.34 ns

~10.5x106 ~7.83x109 41.4 ns 0.29 µs 7.61 ns 2.36 ns

n uncompressed sequence bytes Elias-Fano bytes compression

ratio

~2.4x106 18,787,288 3,530,704 532%

~10.5x106 83,565,504 15,704,680 532%

4 Intel i7-4790K cores (8 threads) clocked at 4Ghz, with 32 GB RAM, running Linux 4.2.0, 64 bits

C++11, compiled with gcc 5.3.0 with the highest optimisation setting

Page 107: Elias-Fano Encoding

14

Performance

Datasets

Space

AND queries (timings are in milliseconds)

Numbers from [Ottaviano and Venturini-2014].

2

24 Intel Xeon E5-2697 Ivy Bridge cores (48 threads) clocked at 2.70Ghz, with 64 GB RAM, running Linux 3.12.7, 64 bits

C++11, compiled with gcc 4.9 with the highest optimisation setting

Page 108: Elias-Fano Encoding

15

Killer applications

1. Inverted IndexesSebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM

International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and

Development in Information Retrieval (SIGIR), 273-282 (2014).

Page 109: Elias-Fano Encoding

15

Killer applications

1. Inverted Indexes

2. Social Networks

Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and

Development in Information Retrieval (SIGIR), 273-282 (2014).

Page 110: Elias-Fano Encoding

15

Killer applications

1. Inverted Indexes

2. Social Networks

Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and

Development in Information Retrieval (SIGIR), 273-282 (2014).

Page 111: Elias-Fano Encoding

15

Killer applications

1. Inverted Indexes

2. Social Networks

Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and

Development in Information Retrieval (SIGIR), 273-282 (2014).

Page 112: Elias-Fano Encoding

15

Killer applications

1. Inverted Indexes

2. Social Networks

Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and

Development in Information Retrieval (SIGIR), 273-282 (2014).

Page 113: Elias-Fano Encoding

Available Implementations

16

Library Author(s) Link Language

folly Facebook, Inc.https://

github.com/facebook/folly

C++

sdsl Simon Goghttps://

github.com/simongog/sdsl-lite

C++

ds2iGiuseppe Ottaviano Rossano Venturini Nicola Tonellotto

https://github.com/ot/ds2i C++

Sux Sebastiano Vigna http://sux.di.unimi.it Java/C++

Page 114: Elias-Fano Encoding

Summary

17

Elias-Fano encodes monotone integer sequences in space close to the information theoretic minimum, while allowing powerful search operations, namely

predecessor/successor queries and random access.

Successfully applied to crucial problems, such as inverted indexes and social graphs representation.

Several optimized software implementations are available.

Page 115: Elias-Fano Encoding

References

18

Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971).

Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, 246–260 (1974).

Guy Jacobson. Succinct Static Data Structures. Ph.D. Thesis, Carnegie Mellon University (1989).

David Clark. Compact Pat Trees. Ph.D. Thesis, University of Waterloo (1996).

[Fano-1971]

[Elias-1974]

[Jacobson-1989]

[Clark-1996]

1

Vo Ngoc Anh and Alistair Moffat. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval Journal 8, 1, 151–166 (2005).

Alistair Moffat and Lang Stuiver. Binary Interpolative Coding for Effective Index Compression. Information Retrieval Journal 3, 1, 25–47 (2000).[Moffat and Stuiver-2000]

[Anh and Moffat-2005]

David Salomon. Variable-length Codes for Data Compression. Springer (2007).[Salomon-2007]

Sebastiano Vigna. Broadword implementation of rank/select queries. In Workshop in Experimental Algorithms (WEA), 154-168 (2008).[Vigna-2008]

Page 116: Elias-Fano Encoding

Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 317–326 (2011).

Vo Ngoc Anh and Alistair Moffat. Index compression using 64-bit words. In Software: Practice and Experience 40, 2, 131–147 (2010).

Marcin Zukowski, Sandor Hèman, Niels Nes, and Peter Boncz. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). 59–70 (2006).

Hao Yan, Shuai Ding, and Torsten Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW). 401–410 (2009).

References 2

19

[Yan et al.-2009]

[Stepanov et al.-2011]

[Anh and Moffat-2010]

[Zukowski et al.-2010]

Michael Curtiss et al. Unicorn: A System for Searching the Social Graph. In Proceedings of the Very Large Database Endowment (PVLDB), 1150-1161 (2013).

Giuseppe Ottaviano, Rossano Venturini. Partitioned Elias-Fano Indexes. In Proceedings of the 37-th ACM International Conference on Research and Development in Information Retrieval (SIGIR), 273-282 (2014).

[Curtiss et al.-2013]

[Ottaviano and Venturini-2014]

Dong Zhou, David Andersen, Michael Kaminsky. Space-Efficient, High-Performance Rank and Select Structures on Uncompressed Bit Sequences. In Proceedings of the 12-nd International Symposium on Experimental Algorithms (SEA), 151-163 (2013).

Sebastiano Vigna. Quasi-succinct indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), 83-92 (2013).[Vigna-2013]

[Zhou et al.-2013]

Page 117: Elias-Fano Encoding

Thanks for your attention, time, patience!

Any questions?

20