Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
All About Bitmap Indexes. . . And Sorting Them
Daniel Lemire
http://www.daniel-lemire.com/
Joint work (presented at BDA’08 and DOLAP’08) with Owen Kaser (UNB) andKamel Aouiche (post-doc).
February 12, 2009
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Improving indexes is practically important.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Improving indexes is practically important.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Improving indexes is practically important.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What make indexes fast?
We are going to use these three ideas:
Expect specific queries? Avoid a full scan!
Data is not random? Compress it!
A specific computer architecture? taylor your code for it!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What make indexes fast?
We are going to use these three ideas:
Expect specific queries? Avoid a full scan!
Data is not random? Compress it!
A specific computer architecture? taylor your code for it!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What make indexes fast?
We are going to use these three ideas:
Expect specific queries? Avoid a full scan!
Data is not random? Compress it!
A specific computer architecture? taylor your code for it!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?
Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Common applications of the bitmaps
The Java language has had a bitmap class since thebeginning: java.util.BitSet.
(Sun’s implementation is basedon 8-bit words.)
Search engines use bitmaps to filter queries, e.g. ApacheLucene
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Common applications of the bitmaps
The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 8-bit words.)
Search engines use bitmaps to filter queries, e.g. ApacheLucene
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Common applications of the bitmaps
The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 8-bit words.)
Search engines use bitmaps to filter queries, e.g. ApacheLucene
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap compression
1
x
... ......
x=1
x=3
x=2
index bitmapscolumn
1 00
00 1
0 0
0
1
0 1
L
n
...
2
1
3
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap compression
1
x
... ......
x=1
x=3
x=2
index bitmapscolumn
1 00
00 1
0 0
0
1
0 1
L
n
...
2
1
3
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap compression
1
x
... ......
x=1
x=3
x=2
index bitmapscolumn
1 00
00 1
0 0
0
1
0 1
L
n
...
2
1
3
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap compression
1
x
... ......
x=1
x=3
x=2
index bitmapscolumn
1 00
00 1
0 0
0
1
0 1
L
n
...
2
1
3
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap compression
1
x
... ......
x=1
x=3
x=2
index bitmapscolumn
1 00
00 1
0 0
0
1
0 1
L
n
...
2
1
3
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →3, 5, 3, 1,1,3
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →3, 5, 3, 1,1,3
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,0001111100010111 →3, 5, 3, 1,1,3
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse.
E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes.
Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;
write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing better with delta codes
RLE can make things worse. E.g., Use 8-bit counters, then11 may become 000000101.
How many bits to use for the counters?
Universal coding like delta codes use no more than c log xbits to represent value x .
Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is4, etc.
Delta codes build on Gamma codes. Has two steps:x = 2N + (x mod 2N).
Write N − 1 as gamma code;write x mod 2N as an N − 1-bit number.
E.g. 17 = 24 + 1, 0010001
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
RLE with delta codes is pretty good
In some (weak) sense, RLE compression with delta codes isoptimal!
Theorem
A bitmap index over an N-value column of length n, compressedwith RLE and delta codes, uses O(n log N) bits.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Is the compression rate what matters?
There is endless debate about whether more compression is better:
Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!
Multi-core CPUs? All problems I/O-bound!
Store your indexes in RAM? All problems are CPU-bound!
. . .
No definitive answer on whether more compression is better. Itdepends!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Is the compression rate what matters?
There is endless debate about whether more compression is better:
Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!
Multi-core CPUs? All problems I/O-bound!
Store your indexes in RAM? All problems are CPU-bound!
. . .
No definitive answer on whether more compression is better. Itdepends!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Is the compression rate what matters?
There is endless debate about whether more compression is better:
Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!
Multi-core CPUs? All problems I/O-bound!
Store your indexes in RAM? All problems are CPU-bound!
. . .
No definitive answer on whether more compression is better. Itdepends!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Is the compression rate what matters?
There is endless debate about whether more compression is better:
Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!
Multi-core CPUs? All problems I/O-bound!
Store your indexes in RAM? All problems are CPU-bound!
. . .
No definitive answer on whether more compression is better. Itdepends!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Is the compression rate what matters?
There is endless debate about whether more compression is better:
Solid-State Drives (SSD) have 10× the bandwidth? Allproblems are CPU-bound!
Multi-core CPUs? All problems I/O-bound!
Store your indexes in RAM? All problems are CPU-bound!
. . .
No definitive answer on whether more compression is better. Itdepends!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
That is what Oracle is doing.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
That is what Oracle is doing.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
That is what Oracle is doing.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
That is what Oracle is doing.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
That is what Oracle is doing.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s (was known to Wu as WBC)word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Model storage cost as #(dirty words) + #(clean words, 0x00)
Storage is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Construction time is proportional to index size. (Data iswritten sequentially on disk.)
Implementation scales to millions of bitmaps.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about other compression types?
Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?
With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).
We don’t know how to do this using the other compressiontechniques!
Hence, with RLE, compress saves both storage and CPUcycles!!!!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about other compression types?
Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?
With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).
We don’t know how to do this using the other compressiontechniques!
Hence, with RLE, compress saves both storage and CPUcycles!!!!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about other compression types?
Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?
With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).
We don’t know how to do this using the other compressiontechniques!
Hence, with RLE, compress saves both storage and CPUcycles!!!!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about other compression types?
Why not compress using other techniques (Huffman, LZ77,Arithmetic Coding, . . . )?
With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 intime O(|B1|+ |B2|).
We don’t know how to do this using the other compressiontechniques!
Hence, with RLE, compress saves both storage and CPUcycles!!!!
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What happens when you have many bitmaps?
Consider B1 ∨ B2 ∨ . . . ∨ BN .
First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).
|B3 ∨ B4| is in O(|B3|+ |B4|).
Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑
i |Bi |). . .
Total is in O(∑N
i=1 |Bi | log N) [Lemire et al., 2009].
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What happens when you have many bitmaps?
Consider B1 ∨ B2 ∨ . . . ∨ BN .
First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).
|B3 ∨ B4| is in O(|B3|+ |B4|).
Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑
i |Bi |). . .
Total is in O(∑N
i=1 |Bi | log N) [Lemire et al., 2009].
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What happens when you have many bitmaps?
Consider B1 ∨ B2 ∨ . . . ∨ BN .
First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).
|B3 ∨ B4| is in O(|B3|+ |B4|).
Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑
i |Bi |). . .
Total is in O(∑N
i=1 |Bi | log N) [Lemire et al., 2009].
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What happens when you have many bitmaps?
Consider B1 ∨ B2 ∨ . . . ∨ BN .
First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).
|B3 ∨ B4| is in O(|B3|+ |B4|).
Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑
i |Bi |). . .
Total is in O(∑N
i=1 |Bi | log N) [Lemire et al., 2009].
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What happens when you have many bitmaps?
Consider B1 ∨ B2 ∨ . . . ∨ BN .
First compute the first two : B1 ∨ B2 in time O(|B1|+ |B2|).
|B3 ∨ B4| is in O(|B3|+ |B4|).
Thus (B1 ∨ B2) ∨ (B3 ∨ B4) takes O(2∑
i |Bi |). . .
Total is in O(∑N
i=1 |Bi | log N) [Lemire et al., 2009].
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering isNP-hard [Lemire et al., 2009].
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering isNP-hard [Lemire et al., 2009].
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering isNP-hard [Lemire et al., 2009].
Lexicographic row sorting is
fast, even for very large tables.
easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering isNP-hard [Lemire et al., 2009].
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering isNP-hard [Lemire et al., 2009].
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
valuecatdogdishfishcowcatpony
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed index
Sort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rows
Compress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rows
Encode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about “other” Gray-codes?
Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]
There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].
Our tests suggest traditional Gray codes are best.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about “other” Gray-codes?
Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]
There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].
Our tests suggest traditional Gray codes are best.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What about “other” Gray-codes?
Define Gray-code to be a way to list all bitvectors whileminimizing Hamming distances [Knuth, 2005, § 7.2.1.1]
There are other alternatives [Goddyn and Gvozdjak, 2003,Savage and Winkler, 1995].
Our tests suggest traditional Gray codes are best.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM.
Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 49
1-C
/N
rang des bitmaps
GrayRandom-sort
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 491-
C/N
rang des bitmaps
GrayRandom-sort
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 491-
C/N
rang des bitmaps
GrayRandom-sort
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
ze
column permutation
1-of-N encoding4-of-N encoding
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
ze
column permutation
1-of-N encoding4-of-N encoding
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
zecolumn permutation
1-of-N encoding4-of-N encoding
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How do 64-bit words compare to 32-bit words?
We implemented EWAH using 16-bit, 32-bit and 64-bit words;
Only 32-bit and 64-bit are efficient;
64-bit indexes are nearly twice as large;
64-bit indexes are between 5%-40% faster (despite higherI/O costs).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How do 64-bit words compare to 32-bit words?
We implemented EWAH using 16-bit, 32-bit and 64-bit words;
Only 32-bit and 64-bit are efficient;
64-bit indexes are nearly twice as large;
64-bit indexes are between 5%-40% faster (despite higherI/O costs).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How do 64-bit words compare to 32-bit words?
We implemented EWAH using 16-bit, 32-bit and 64-bit words;
Only 32-bit and 64-bit are efficient;
64-bit indexes are nearly twice as large;
64-bit indexes are between 5%-40% faster (despite higherI/O costs).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
How do 64-bit words compare to 32-bit words?
We implemented EWAH using 16-bit, 32-bit and 64-bit words;
Only 32-bit and 64-bit are efficient;
64-bit indexes are nearly twice as large;
64-bit indexes are between 5%-40% faster (despite higherI/O costs).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Open Source Software?
Lemur Bitmap Index C++ Library:http://code.google.com/p/lemurbitmapindex/.
JavaEWAH: A compressed alternative to the Java BitSet classhttp://code.google.com/p/javaewah/.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Open Source Software?
Lemur Bitmap Index C++ Library:http://code.google.com/p/lemurbitmapindex/.
JavaEWAH: A compressed alternative to the Java BitSet classhttp://code.google.com/p/javaewah/.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Future direction?
Need better mathematical modelling of bitmap compressedsize in sorted tables;
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006).Improving bitmap index compression by data reorganization.http://hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked2008-12-15).
Goddyn, L. and Gvozdjak, P. (2003).Binary gray codes with long bit runs.Electronic Journal of Combinatorics, 10(R27):1–10.
Knuth, D. E. (2005).The Art of Computer Programming, volume 4, chapter fascicle2.Addison Wesley.
Lemire, D., Kaser, O., and Aouiche, K. (2009).Sorting improves word-aligned bitmap indexes.available from http://arxiv.org/abs/0901.3751.
Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them
Compressing bitmap indices by data reorganization.In ICDE’05, pages 310–321.
Savage, C. and Winkler, P. (1995).Monotone gray codes and the middle levels problem.Journal of Combinatorial Theory, A, 70(2):230–248.
Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.
Daniel Lemire All About Bitmap Indexes. . . And Sorting Them