129
All About Bitmap Indexes. . . And Sorting Them Daniel Lemire http://www.daniel-lemire.com/ Joint work (presented at BDA’08 and DOLAP’08) with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 12, 2009 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

All About Bitmap IndexesAnd Sorting Them

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

All About Bitmap Indexes. . . And Sorting Them

Daniel Lemire

http://www.daniel-lemire.com/

Joint work (presented at BDA’08 and DOLAP’08) with Owen Kaser (UNB) andKamel Aouiche (post-doc).

February 12, 2009

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Improving indexes is practically important.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Improving indexes is practically important.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Improving indexes is practically important.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What make indexes fast?

We are going to use these three ideas:

Expect specific queries? Avoid a full scan!

Data is not random? Compress it!

A specific computer architecture? taylor your code for it!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What make indexes fast?

We are going to use these three ideas:

Expect specific queries? Avoid a full scan!

Data is not random? Compress it!

A specific computer architecture? taylor your code for it!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What make indexes fast?

We are going to use these three ideas:

Expect specific queries? Avoid a full scan!

Data is not random? Compress it!

A specific computer architecture? taylor your code for it!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?

Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Common applications of the bitmaps

The Java language has had a bitmap class since thebeginning: java.util.BitSet.

(Sun’s implementation is basedon 8-bit words.)

Search engines use bitmaps to filter queries, e.g. ApacheLucene

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Common applications of the bitmaps

The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 8-bit words.)

Search engines use bitmaps to filter queries, e.g. ApacheLucene

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Common applications of the bitmaps

The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 8-bit words.)

Search engines use bitmaps to filter queries, e.g. ApacheLucene

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression

1

x

... ......

x=1

x=3

x=2

index bitmapscolumn

1 00

00 1

0 0

0

1

0 1

L

n

...

2

1

3

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,

0001111100010111 →3, 5, 3, 1,1,3

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,

0001111100010111 →3, 5, 3, 1,1,3

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,0001111100010111 →3, 5, 3, 1,1,3

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse.

E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes.

Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;

write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes

RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.

How many bits to use for the counters?

Universal coding like delta codes use no more than c log xbits to represent value x .

Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.

Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).

Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.

E.g. 17 = 24 + 1, 0010001

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

RLE with delta codes is pretty good

In some (weak) sense, RLE compression with delta codes isoptimal!

Theorem

A bitmap index over an N-value column of length n, compressedwith RLE and delta codes, uses O(n log N) bits.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters?

There is endless debate about whether more compression is better:

Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!

Multi-core CPUs? All problems I/O-bound!

Store your indexes in RAM? All problems are CPU-bound!

. . .

No definitive answer on whether more compression is better. Itdepends!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters?

There is endless debate about whether more compression is better:

Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!

Multi-core CPUs? All problems I/O-bound!

Store your indexes in RAM? All problems are CPU-bound!

. . .

No definitive answer on whether more compression is better. Itdepends!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters?

There is endless debate about whether more compression is better:

Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!

Multi-core CPUs? All problems I/O-bound!

Store your indexes in RAM? All problems are CPU-bound!

. . .

No definitive answer on whether more compression is better. Itdepends!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters?

There is endless debate about whether more compression is better:

Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!

Multi-core CPUs? All problems I/O-bound!

Store your indexes in RAM? All problems are CPU-bound!

. . .

No definitive answer on whether more compression is better. Itdepends!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters?

There is endless debate about whether more compression is better:

Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!

Multi-core CPUs? All problems I/O-bound!

Store your indexes in RAM? All problems are CPU-bound!

. . .

No definitive answer on whether more compression is better. Itdepends!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

That is what Oracle is doing.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Model storage cost as #(dirty words) + #(clean words, 0x00)

Storage is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Construction time is proportional to index size. (Data iswritten sequentially on disk.)

Implementation scales to millions of bitmaps.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about other compression types?

Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?

With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).

We don’t know how to do this using the other compressiontechniques!

Hence, with RLE, compress saves both storage and CPUcycles!!!!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about other compression types?

Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?

With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).

We don’t know how to do this using the other compressiontechniques!

Hence, with RLE, compress saves both storage and CPUcycles!!!!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about other compression types?

Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?

With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).

We don’t know how to do this using the other compressiontechniques!

Hence, with RLE, compress saves both storage and CPUcycles!!!!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about other compression types?

Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?

With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).

We don’t know how to do this using the other compressiontechniques!

Hence, with RLE, compress saves both storage and CPUcycles!!!!

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N) [Lemire et al., 2009].

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N) [Lemire et al., 2009].

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N) [Lemire et al., 2009].

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N) [Lemire et al., 2009].

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps?

Consider B1 ∨ B2 ∨ . . . ∨ BN .

First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).

|B3 ∨ B4| is in O(|B3|+ |B4|).

Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑

i |Bi |). . .

Total is in O(∑N

i=1 |Bi | log N) [Lemire et al., 2009].

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

Lexicographic row sorting is

fast, even for very large tables.

easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2009].

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

valuecatdogdishfishcowcatpony

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed index

Sort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rows

Compress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rows

Encode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about “other” Gray-codes?

Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]

There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].

Our tests suggest traditional Gray codes are best.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about “other” Gray-codes?

Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]

There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].

Our tests suggest traditional Gray codes are best.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about “other” Gray-codes?

Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]

There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].

Our tests suggest traditional Gray codes are best.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM.

Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 49

1-C

/N

rang des bitmaps

GrayRandom-sort

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 491-

C/N

rang des bitmaps

GrayRandom-sort

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 491-

C/N

rang des bitmaps

GrayRandom-sort

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

ze

column permutation

1-of-N encoding4-of-N encoding

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

ze

column permutation

1-of-N encoding4-of-N encoding

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

zecolumn permutation

1-of-N encoding4-of-N encoding

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How do 64-bit words compare to 32-bit words?

We implemented EWAH using 16-bit, 32-bit and 64-bit words;

Only 32-bit and 64-bit are efficient;

64-bit indexes are nearly twice as large;

64-bit indexes are between 5%-40% faster (despite higherI/O costs).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How do 64-bit words compare to 32-bit words?

We implemented EWAH using 16-bit, 32-bit and 64-bit words;

Only 32-bit and 64-bit are efficient;

64-bit indexes are nearly twice as large;

64-bit indexes are between 5%-40% faster (despite higherI/O costs).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How do 64-bit words compare to 32-bit words?

We implemented EWAH using 16-bit, 32-bit and 64-bit words;

Only 32-bit and 64-bit are efficient;

64-bit indexes are nearly twice as large;

64-bit indexes are between 5%-40% faster (despite higherI/O costs).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How do 64-bit words compare to 32-bit words?

We implemented EWAH using 16-bit, 32-bit and 64-bit words;

Only 32-bit and 64-bit are efficient;

64-bit indexes are nearly twice as large;

64-bit indexes are between 5%-40% faster (despite higherI/O costs).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Open Source Software?

Lemur Bitmap Index C++ Library:http://code.google.com/p/lemurbitmapindex/.

JavaEWAH: A compressed alternative to the Java BitSet classhttp://code.google.com/p/javaewah/.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Open Source Software?

Lemur Bitmap Index C++ Library:http://code.google.com/p/lemurbitmapindex/.

JavaEWAH: A compressed alternative to the Java BitSet classhttp://code.google.com/p/javaewah/.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Future direction?

Need better mathematical modelling of bitmap compressedsize in sorted tables;

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Questions?

?

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006).Improving bitmap index compression by data reorganization.http://hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked2008-12-15).

Goddyn, L. and Gvozdjak, P. (2003).Binary gray codes with long bit runs.Electronic Journal of Combinatorics, 10(R27):1–10.

Knuth, D. E. (2005).The Art of Computer Programming, volume 4, chapter fascicle2.Addison Wesley.

Lemire, D., Kaser, O., and Aouiche, K. (2009).Sorting improves word-aligned bitmap indexes.available from http://arxiv.org/abs/0901.3751.

Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing bitmap indices by data reorganization.In ICDE’05, pages 310–321.

Savage, C. and Winkler, P. (1995).Monotone gray codes and the middle levels problem.Journal of Combinatorial Theory, A, 70(2):230–248.

Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.

Daniel Lemire All About Bitmap Indexes. . . And Sorting Them