Upload
nguyenxuyen
View
219
Download
1
Embed Size (px)
Citation preview
Sorting Improves Bitmap Indexes
Owen Kaser
Joint work (presented at BDA’08 and DOLAP’08) with Daniel Lemire and KamelAouiche, UQAM.
December 4, 2008
Owen Kaser Sorting Improves Bitmap Indexes
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Who makes the tradeoff for a given database?
system itself?database designer/administrator?
Improving indexes is practically important.
Owen Kaser Sorting Improves Bitmap Indexes
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Who makes the tradeoff for a given database?
system itself?database designer/administrator?
Improving indexes is practically important.
Owen Kaser Sorting Improves Bitmap Indexes
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Who makes the tradeoff for a given database?
system itself?database designer/administrator?
Improving indexes is practically important.
Owen Kaser Sorting Improves Bitmap Indexes
Database Indexes
Databases use precomputed indexes (auxiliary data structures)to speed processing.
An index costs memory, can hurt update speed.
Who makes the tradeoff for a given database?
system itself?database designer/administrator?
Improving indexes is practically important.
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap indexes
SELECT * FROMT WHERE x=aAND y=b;
Bitmap indexes have a longhistory. (1972 at IBM.)
Long history with DW & OLAP.(Sybase IQ since mid 1990s).
Main competition: B-trees.
Above, compute
{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}
Owen Kaser Sorting Improves Bitmap Indexes
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?
Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Owen Kaser Sorting Improves Bitmap Indexes
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Owen Kaser Sorting Improves Bitmap Indexes
Bitmaps and fast AND/OR operations
Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)
Extend to sets from 1..N using dN/64e operations.
To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap compression
x1
3
1
2...
n
L
10
1
0
00
...
...
...
x=
1
x=
2
x=
3
10 0
0 01
column index bitmaps
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap compression
x1
3
1
2...
n
L
10
1
0
00
...
...
...
x=
1
x=
2
x=
3
10 0
0 01
column index bitmaps
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap compression
x1
3
1
2...
n
L
10
1
0
00
...
...
...
x=
1
x=
2
x=
3
10 0
0 01
column index bitmaps
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap compression
x1
3
1
2...
n
L
10
1
0
00
...
...
...
x=
1
x=
2
x=
3
10 0
0 01
column index bitmaps
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap compression
x1
3
1
2...
n
L
10
1
0
00
...
...
...
x=
1
x=
2
x=
3
10 0
0 01
column index bitmaps
A column with n rows and L distinctvalues ⇒ nL bits
E.g., n = 106, L = 104 → 10 Gbits
Uncompressed bitmaps are oftenimpractical
Moreover, bitmaps often contain longstreams of zeroes. . .
Logical operations over these zeroes is awaste of CPU cycles.
Owen Kaser Sorting Improves Bitmap Indexes
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →3, 5, 3, 1,1,3
Owen Kaser Sorting Improves Bitmap Indexes
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →3, 5, 3, 1,1,3
Owen Kaser Sorting Improves Bitmap Indexes
How to compress bitmaps?
Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)
Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .
So just encode the run lengths, e.g.,
0001111100010111 →3, 5, 3, 1,1,3
Owen Kaser Sorting Improves Bitmap Indexes
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Owen Kaser Sorting Improves Bitmap Indexes
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Owen Kaser Sorting Improves Bitmap Indexes
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Owen Kaser Sorting Improves Bitmap Indexes
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Owen Kaser Sorting Improves Bitmap Indexes
Byte/Word-aligned RLE
RLE variants can focus on runs that align with machine-wordboundaries.
Trade compression for speed.
Variants: BBC (byte aligned), WAH
Our EWAH extends Wu et al.’s word-aligned hybrid.
0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .
Owen Kaser Sorting Improves Bitmap Indexes
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Total size of bitmaps is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Implementation scales to millions of bitmaps.
Owen Kaser Sorting Improves Bitmap Indexes
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Total size of bitmaps is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Implementation scales to millions of bitmaps.
Owen Kaser Sorting Improves Bitmap Indexes
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Total size of bitmaps is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Implementation scales to millions of bitmaps.
Owen Kaser Sorting Improves Bitmap Indexes
Computational and storage bounds
n → number of rows, c → number of 1s per row;
Total size of bitmaps is in O(nc);
Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).
Implementation scales to millions of bitmaps.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering is NP-hard.
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering is NP-hard.
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering is NP-hard.
Lexicographic row sorting is
fast, even for very large tables.
easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering is NP-hard.
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression by sorting the table
RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;
But finding the best row ordering is NP-hard.
Lexicographic row sorting is
fast, even for very large tables.easy: sort is a Unix staple.
Substantial index-size reductions (often 2.5 times)
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
valuecatdogdishfishcowcatpony
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Improving compression via k-of-N encoding
1-of-N100000010000001000000100000010100000000001
2-of-N1100101010010110010111000011
With L bitmaps, you can represent L valuesby mapping each value to one bitmap;
Alternatively, you can represent(L2
)= L(L− 1)/2 values by mapping each
value to a pair of bitmaps;
More generally, you can represent(Lk
)values
by mapping each value to a k-tuple ofbitmaps;
At query time, you need to load k bitmapsin a look-up for one value;
You trade query-time performance forfewer bitmaps;
Often, fewer bitmaps translates into asmaller index, created faster.
Owen Kaser Sorting Improves Bitmap Indexes
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed index
Sort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Owen Kaser Sorting Improves Bitmap Indexes
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rows
Compress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Owen Kaser Sorting Improves Bitmap Indexes
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Owen Kaser Sorting Improves Bitmap Indexes
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rows
Encode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Owen Kaser Sorting Improves Bitmap Indexes
Encode then sort? Or vice versa?
Two different conceptual approaches:
1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column
2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.
paint maker
red fordblue hondagreen ford. . . . . .
⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1
. . . . . .
⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1
. . . . . .
paint maker
red fordblue hondagreen ford. . . . . .
⇒
paint maker
blue hondagreen fordred ford. . . . . .
⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1
. . . . . .
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code order
Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1
Gray-code
0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1
Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);
May improve compression morethan lex. sort (k > 1);
[Pinar et al., 2005] process anuncompressed bitmap index.
Slow, if uncompressed indexdoes not fit in RAM.
GC order is not supported byDBMSes or Unix utilities.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes
eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Gray-code sorting, cheaply
Size improvement is small (usually < 4%), but it’s essentially free:
1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);
2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]
3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;
Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).
4 Easily extended for > 1 columns.
In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.
Owen Kaser Sorting Improves Bitmap Indexes
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM.
Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Owen Kaser Sorting Improves Bitmap Indexes
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Owen Kaser Sorting Improves Bitmap Indexes
Test data sets
Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?
Our tests: Mix of real and synthetic data,
up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 49
1-C
/N
rang des bitmaps
GrayRandom-sort
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 491-
C/N
rang des bitmaps
GrayRandom-sort
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
The first column(s) gainmore from the sort(column 1 is primary sortkey);
Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)
Least important column’sbitmaps (43–49) don’tgain much (red vs green)
Compression on TWEED-4d
0
0.2
0.4
0.6
0.8
1
11 18 43 491-
C/N
rang des bitmaps
GrayRandom-sort
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
ze
column permutation
k=1k=4
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
ze
column permutation
k=1k=4
Owen Kaser Sorting Improves Bitmap Indexes
When sorting, column order matters
Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.
Column order is crucial(to successful sorting).
Finding the best orderingquickly remains open.
Netflix: 24 column orderings
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
4e+08
4.5e+08
5e+08
5.5e+08
432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234
inde
x si
zecolumn permutation
k=1k=4
Owen Kaser Sorting Improves Bitmap Indexes
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Owen Kaser Sorting Improves Bitmap Indexes
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Owen Kaser Sorting Improves Bitmap Indexes
Progress toward choosing column order
Paper models “gain” of putting a given column first.
Idea: order columns greedily (by max gain).
Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.
Factors:
skews of columnsnumber of distinct valueskdensity of column’s bitmaps
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k=1
For 1-of-N bitmaps, a density-based approach was okay:
Ordering rule, k = 1 : “sparse but not too sparse”
Order columns by decreasing
min
(1
ni,
1− 1/ni
4w − 1
), where
50 100 150 200 250 300
distinct values in column
ni → the number of distinct values in column i ,
w → the word size.
See 30–40% size reduction, merely knowing dimension sizes (ni ).
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Owen Kaser Sorting Improves Bitmap Indexes
What usually works for dimension ordering?: k > 1
Density formula (ni → k√
ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:
When k > 1, order columns by
1 descending skew
2 descending size
(And do the reverse when k = 1.)
Open issues, k > 1
1 How do we balance skew & size factors?
2 What other properties of the histograms are needed?
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Owen Kaser Sorting Improves Bitmap Indexes
Bitmap-by-bitmap reordering
One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].
Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.
We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.
Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)
Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)
Owen Kaser Sorting Improves Bitmap Indexes
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Owen Kaser Sorting Improves Bitmap Indexes
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Owen Kaser Sorting Improves Bitmap Indexes
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Owen Kaser Sorting Improves Bitmap Indexes
Index size versus block-wise sorting
Netflix
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
tail
le d
e l’
inde
x (M
o)
# de blocs
k=1
k=2
k=3
k=4
Instead of fully sorting thetable, we sorted itblock-wise;
Fewer blocks means amore complete sort;
Larger k means smallerindex (in this case);
Index size diminishesdrastically with sorting.
Owen Kaser Sorting Improves Bitmap Indexes
Future directions
Need better mathematical modelling of bitmap compressedsize in sorted tables;
Study the effect of word length (16, 32, 64, 128 bits);
Investigate “Long run Gray code” (discussed by Knuth).
Owen Kaser Sorting Improves Bitmap Indexes
Future directions
Need better mathematical modelling of bitmap compressedsize in sorted tables;
Study the effect of word length (16, 32, 64, 128 bits);
Investigate “Long run Gray code” (discussed by Knuth).
Owen Kaser Sorting Improves Bitmap Indexes
Future directions
Need better mathematical modelling of bitmap compressedsize in sorted tables;
Study the effect of word length (16, 32, 64, 128 bits);
Investigate “Long run Gray code” (discussed by Knuth).
Owen Kaser Sorting Improves Bitmap Indexes
Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006).Improving bitmap index compression by data reorganization.http://hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked2008-05-30).
Chan, C. Y. and Ioannidis, Y. E. (1999).An efficient bitmap encoding scheme for selection queries.In SIGMOD’99, pages 215–226.
Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).Compressing bitmap indices by data reorganization.In ICDE’05, pages 310–321.
Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.
Owen Kaser Sorting Improves Bitmap Indexes