90
Sorting Improves Bitmap Indexes Owen Kaser Joint work (presented at BDA’08 and DOLAP’08) with Daniel Lemire and Kamel Aouiche, UQAM. December 4, 2008 Owen Kaser Sorting Improves Bitmap Indexes

Sorting Improves Bitmap Indexes - UNBowen/unbsj-res-forum08.pdf · (Sybase IQ since mid 1990s). Main competition: B-trees. ... Owen Kaser Sorting Improves Bitmap Indexes. Improving

Embed Size (px)

Citation preview

Sorting Improves Bitmap Indexes

Owen Kaser

Joint work (presented at BDA’08 and DOLAP’08) with Daniel Lemire and KamelAouiche, UQAM.

December 4, 2008

Owen Kaser Sorting Improves Bitmap Indexes

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Who makes the tradeoff for a given database?

system itself?database designer/administrator?

Improving indexes is practically important.

Owen Kaser Sorting Improves Bitmap Indexes

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Who makes the tradeoff for a given database?

system itself?database designer/administrator?

Improving indexes is practically important.

Owen Kaser Sorting Improves Bitmap Indexes

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Who makes the tradeoff for a given database?

system itself?database designer/administrator?

Improving indexes is practically important.

Owen Kaser Sorting Improves Bitmap Indexes

Database Indexes

Databases use precomputed indexes (auxiliary data structures)to speed processing.

An index costs memory, can hurt update speed.

Who makes the tradeoff for a given database?

system itself?database designer/administrator?

Improving indexes is practically important.

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Bitmap indexes have a longhistory. (1972 at IBM.)

Long history with DW & OLAP.(Sybase IQ since mid 1990s).

Main competition: B-trees.

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Owen Kaser Sorting Improves Bitmap Indexes

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?

Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Owen Kaser Sorting Improves Bitmap Indexes

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Owen Kaser Sorting Improves Bitmap Indexes

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap compression

x1

3

1

2...

n

L

10

1

0

00

...

...

...

x=

1

x=

2

x=

3

10 0

0 01

column index bitmaps

A column with n rows and L distinctvalues ⇒ nL bits

E.g., n = 106, L = 104 → 10 Gbits

Uncompressed bitmaps are oftenimpractical

Moreover, bitmaps often contain longstreams of zeroes. . .

Logical operations over these zeroes is awaste of CPU cycles.

Owen Kaser Sorting Improves Bitmap Indexes

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,

0001111100010111 →3, 5, 3, 1,1,3

Owen Kaser Sorting Improves Bitmap Indexes

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,

0001111100010111 →3, 5, 3, 1,1,3

Owen Kaser Sorting Improves Bitmap Indexes

How to compress bitmaps?

Must handle long streams of zeroes efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,

0001111100010111 →3, 5, 3, 1,1,3

Owen Kaser Sorting Improves Bitmap Indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser Sorting Improves Bitmap Indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser Sorting Improves Bitmap Indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser Sorting Improves Bitmap Indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser Sorting Improves Bitmap Indexes

Byte/Word-aligned RLE

RLE variants can focus on runs that align with machine-wordboundaries.

Trade compression for speed.

Variants: BBC (byte aligned), WAH

Our EWAH extends Wu et al.’s word-aligned hybrid.

0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . .⇒ dirty word, run of 2 “clean 0” words, dirty word. . .

Owen Kaser Sorting Improves Bitmap Indexes

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Total size of bitmaps is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Implementation scales to millions of bitmaps.

Owen Kaser Sorting Improves Bitmap Indexes

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Total size of bitmaps is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Implementation scales to millions of bitmaps.

Owen Kaser Sorting Improves Bitmap Indexes

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Total size of bitmaps is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Implementation scales to millions of bitmaps.

Owen Kaser Sorting Improves Bitmap Indexes

Computational and storage bounds

n → number of rows, c → number of 1s per row;

Total size of bitmaps is in O(nc);

Bounds do not depend on the number ofbitmaps. (Assuming O(n) bitmaps).

Implementation scales to millions of bitmaps.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering is NP-hard.

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering is NP-hard.

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering is NP-hard.

Lexicographic row sorting is

fast, even for very large tables.

easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering is NP-hard.

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression by sorting the table

RLE, BBC, WAH, EWAH are order-sensitive:they compress sorted tables better;

But finding the best row ordering is NP-hard.

Lexicographic row sorting is

fast, even for very large tables.easy: sort is a Unix staple.

Substantial index-size reductions (often 2.5 times)

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

valuecatdogdishfishcowcatpony

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Improving compression via k-of-N encoding

1-of-N100000010000001000000100000010100000000001

2-of-N1100101010010110010111000011

With L bitmaps, you can represent L valuesby mapping each value to one bitmap;

Alternatively, you can represent(L2

)= L(L− 1)/2 values by mapping each

value to a pair of bitmaps;

More generally, you can represent(Lk

)values

by mapping each value to a k-tuple ofbitmaps;

At query time, you need to load k bitmapsin a look-up for one value;

You trade query-time performance forfewer bitmaps;

Often, fewer bitmaps translates into asmaller index, created faster.

Owen Kaser Sorting Improves Bitmap Indexes

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed index

Sort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Owen Kaser Sorting Improves Bitmap Indexes

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rows

Compress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Owen Kaser Sorting Improves Bitmap Indexes

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Owen Kaser Sorting Improves Bitmap Indexes

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rows

Encode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Owen Kaser Sorting Improves Bitmap Indexes

Encode then sort? Or vice versa?

Two different conceptual approaches:

1 Encode attributes in table, obtaining an uncompressed indexSort the index rowsCompress each column

2 Sort the table rowsEncode attributes in table, build compressed index on-the-fly.

paint maker

red fordblue hondagreen ford. . . . . .

⇒1 1 0 1 0 11 0 1 0 1 10 1 1 1 0 1

. . . . . .

⇒0 1 1 1 0 11 0 1 0 1 11 1 0 1 0 1

. . . . . .

paint maker

red fordblue hondagreen ford. . . . . .

paint maker

blue hondagreen fordred ford. . . . . .

⇒1 0 1 0 1 11 1 0 1 0 10 1 1 1 0 1

. . . . . .

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code order

Lex. order0 1 10 1 11 0 11 0 11 1 01 1 01 1 11 1 11 1 1

Gray-code

0 1 10 1 11 1 01 1 01 1 11 1 11 1 11 0 11 0 1

Gray-code (GC) order is analternative to lexicographicalorder (defined only for bitarrays);

May improve compression morethan lex. sort (k > 1);

[Pinar et al., 2005] process anuncompressed bitmap index.

Slow, if uncompressed indexdoes not fit in RAM.

GC order is not supported byDBMSes or Unix utilities.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codes

eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Gray-code sorting, cheaply

Size improvement is small (usually < 4%), but it’s essentially free:

1 What Pinar et al. do: expensive GC sort after encodingeg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]);

2 Instead, sort the table lexicographically—comparing valuesalphabetically or by frequency (easy);eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax]

3 Map ordered values to k-tuples of bitmaps ordered as Graycodes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100;

Lex ascending sequence: Cat, Dog, Girl, Tax.GC ascending sequence: 0011, 0110, 0101, 1100 for codeseg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100](generates a GC-sorted result without expensive GC sorting).

4 Easily extended for > 1 columns.

In our tests, this is as good as a Gray-code bitmap indexsort [Pinar et al., 2005], but technically much easier.

Owen Kaser Sorting Improves Bitmap Indexes

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM.

Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Owen Kaser Sorting Improves Bitmap Indexes

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Owen Kaser Sorting Improves Bitmap Indexes

Test data sets

Previous studies used data sets where the uncompressed indexwould fit in RAM. Do their results apply to more realisticdata sets?

Our tests: Mix of real and synthetic data,

up to 877 M rows, 22 GB, 4 M attribute values.using 4–10 columns/dimensions

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 49

1-C

/N

rang des bitmaps

GrayRandom-sort

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 491-

C/N

rang des bitmaps

GrayRandom-sort

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

The first column(s) gainmore from the sort(column 1 is primary sortkey);

Its bitmaps (first 11 inexample) are compressedwell, compared to a“randomsort”. (Redabove green)

Least important column’sbitmaps (43–49) don’tgain much (red vs green)

Compression on TWEED-4d

0

0.2

0.4

0.6

0.8

1

11 18 43 491-

C/N

rang des bitmaps

GrayRandom-sort

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

ze

column permutation

k=1k=4

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

ze

column permutation

k=1k=4

Owen Kaser Sorting Improves Bitmap Indexes

When sorting, column order matters

Conceptually, we maywish to reorder columns,eg swap columns 1 & 3.

Column order is crucial(to successful sorting).

Finding the best orderingquickly remains open.

Netflix: 24 column orderings

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

4e+08

4.5e+08

5e+08

5.5e+08

432143124231421341324123342134123241321431423124243124132341231421432134143214231342132412431234

inde

x si

zecolumn permutation

k=1k=4

Owen Kaser Sorting Improves Bitmap Indexes

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Owen Kaser Sorting Improves Bitmap Indexes

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Owen Kaser Sorting Improves Bitmap Indexes

Progress toward choosing column order

Paper models “gain” of putting a given column first.

Idea: order columns greedily (by max gain).

Experimentally, this approach is not promising: the bestorderings don’t seem to depend on gain.

Factors:

skews of columnsnumber of distinct valueskdensity of column’s bitmaps

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k=1

For 1-of-N bitmaps, a density-based approach was okay:

Ordering rule, k = 1 : “sparse but not too sparse”

Order columns by decreasing

min

(1

ni,

1− 1/ni

4w − 1

), where

50 100 150 200 250 300

distinct values in column

ni → the number of distinct values in column i ,

w → the word size.

See 30–40% size reduction, merely knowing dimension sizes (ni ).

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Owen Kaser Sorting Improves Bitmap Indexes

What usually works for dimension ordering?: k > 1

Density formula (ni → k√

ni ) recommends poorly when k > 1. Ourexperiments on synthetic data give some guidance:

When k > 1, order columns by

1 descending skew

2 descending size

(And do the reverse when k = 1.)

Open issues, k > 1

1 How do we balance skew & size factors?

2 What other properties of the histograms are needed?

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Owen Kaser Sorting Improves Bitmap Indexes

Bitmap-by-bitmap reordering

One might instead make the index, reorder its columns, thenapply GC sort [Canahuate et al., 2006].

Our best implementation of this is ≈ 100 times slower, cannothandle larger data sets.

We tried several bitmap orders on DBGEN and Census. Outof 8 cases, only one gained, and only by 3%.

Canahaute suggests ordering does not matter much, but wesee factor-of-2 differences (??)

Seems sufficient (and much faster) to work with groups ofbitmaps (reorder attributes, not bitmaps)

Owen Kaser Sorting Improves Bitmap Indexes

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Owen Kaser Sorting Improves Bitmap Indexes

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Owen Kaser Sorting Improves Bitmap Indexes

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Owen Kaser Sorting Improves Bitmap Indexes

Index size versus block-wise sorting

Netflix

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

tail

le d

e l’

inde

x (M

o)

# de blocs

k=1

k=2

k=3

k=4

Instead of fully sorting thetable, we sorted itblock-wise;

Fewer blocks means amore complete sort;

Larger k means smallerindex (in this case);

Index size diminishesdrastically with sorting.

Owen Kaser Sorting Improves Bitmap Indexes

Future directions

Need better mathematical modelling of bitmap compressedsize in sorted tables;

Study the effect of word length (16, 32, 64, 128 bits);

Investigate “Long run Gray code” (discussed by Knuth).

Owen Kaser Sorting Improves Bitmap Indexes

Future directions

Need better mathematical modelling of bitmap compressedsize in sorted tables;

Study the effect of word length (16, 32, 64, 128 bits);

Investigate “Long run Gray code” (discussed by Knuth).

Owen Kaser Sorting Improves Bitmap Indexes

Future directions

Need better mathematical modelling of bitmap compressedsize in sorted tables;

Study the effect of word length (16, 32, 64, 128 bits);

Investigate “Long run Gray code” (discussed by Knuth).

Owen Kaser Sorting Improves Bitmap Indexes

Questions?

?

Owen Kaser Sorting Improves Bitmap Indexes

Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006).Improving bitmap index compression by data reorganization.http://hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked2008-05-30).

Chan, C. Y. and Ioannidis, Y. E. (1999).An efficient bitmap encoding scheme for selection queries.In SIGMOD’99, pages 215–226.

Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005).Compressing bitmap indices by data reorganization.In ICDE’05, pages 310–321.

Wu, K., Otoo, E. J., and Shoshani, A. (2006).Optimizing bitmap indices with efficient compression.ACM Transactions on Database Systems, 31(1):1–38.

Owen Kaser Sorting Improves Bitmap Indexes

Avoiding column order altogether

“Frequent component” approach, but did not work well, plusslower.

Owen Kaser Sorting Improves Bitmap Indexes