Online Bigtable merge compaction - UCRneal/Slides/bigtable_merge_compaction.pdf · 2015-09-22 ·...

Preview:

Citation preview

Online Bigtable merge compaction

work in progress

Claire Mathieu

CNRS Paris

Carl Staelin

Google Haifa

Neal E. Young

1

UC Riverside

Arman Yousefia

UCLA

instigator memy student

you are here

this is now

Northeastern University, September 17, 2015

1funded by faculty re$earch award

BIGTABLE — data storage at

Google Maps, Search/Crawl, Gmail . . . use BIGTABLE to store data.

I 24,500 Bigtable Servers

I 1.2 million requests per second

I 16 GB/s of outgoing RPC tra�c

I over a petabyte of data just for Google Crawl and Analytics

I these figures are from 2006

Similar to other “NoSQL” databases:

Accumulo, AsterixDB, Cassandra, HBase, Hypertable, Spanner, . . .

Used by Adobe, Ebay, Facebook, GitHub, Meetup, Netflix, Twitter, . . .

“Log-structured merge tree” architecture— for high-volume, highly reliable, distributed, real-time data storage.

BIGTABLE — implements dictionary data type

operations supported by a Bigtable instance:

I write(key, value)

I read(key) — return most recent value written for key

I. . . there’s more, but not today . . .

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a)

file sequence

write(1, a);

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b)

file sequence

write(1, a); write(2, b);

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b) (3, c)

file sequence

write(1, a); write(2, b); write(3, c);

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (1, a) (2, b) (3, c) (4, d)

file sequence

write(1, a); write(2, b); write(3, c); write(4, d);

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: (5, e) (6, f ) (7, g)

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

write(5, e); write(6, f ); write(7, g);

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

write(5, e); write(6, f ); write(7, g); flush();

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

(8, h) (9, i)| {z }from 3rd flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

write(5, e); write(6, f ); write(7, g); flush();

write(8, h); write(9, i); flush();

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

(8, h) (9, i)| {z }from 3rd flush

Environment forces Flushes at arbitrary times.

BIGTABLE — reads and compactions

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

(8, h) (9, i)| {z }from 3rd flush

read(key):

1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)

BIGTABLE — reads and compactions

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

(8, h) (9, i)| {z }from 3rd flush

read(key):

1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)

compaction(): asynchronous background process, to reduce read costs

Periodically select files to merge.

BIGTABLE — reads and compactions

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

(5, e) (6, f ) (7, g) (8, h) (9, i)| {z }

merge of 2nd and 3rd

read(key):

1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)

compaction(): asynchronous background process, to reduce read costs

Periodically select files to merge. cost = O(SIZE of merged files) !!

goals: (i) keep read costs low(ii) keep compaction costs low

constraint: each merge must merge a contiguous subsequence of files

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k =1, problem is easy — never merge

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k =1, problem is easy — never merge

after flush 1:

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k =1, problem is easy — never merge

after flush 1:after flush 2:

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k =1, problem is easy — never merge

after flush 1:after flush 2:after flush 3:after flush 4:

...Total compaction cost = 0.

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:after flush 2: too many files!

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:after flush 2: compaction cost x

1

+ x2

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:after flush 2: compaction cost x

1

+ x2

after flush 3: too many files!

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:after flush 2: compaction cost x

1

+ x2

after flush 3: compaction cost x1

+ x2

+ x3

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, x2

, . . . , xn

. xt

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

If k = 1, problem is easy — must merge everything each time

after flush 1:after flush 2: compaction cost x

1

+ x2

after flush 3: compaction cost x1

+ x2

+ x3

...after flush n: compaction cost x

1

+ · · ·+ xn

Total compaction costP

n

i=2

(x1

+ x2

+ · · ·+ xi

) ⇡P

n

i=1

(n� i +1)xi

.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

4.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

4.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

4.

5.

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

4.

5.

...

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

1.

2.

3.

4.

5.

...

Total compaction cost = ⇥(n2).

n

2

-

2

66666666666666666664

for general k, cost is ⇥(n

2/3k�1

)

OPTIMAL solution for k = 2, uniform x = 1, 1, 1, . . .1.

2.

3.

4.

...

“big” merges: O(pn), of size O(n)

“small” merges: O(n), of size O(pn)

pn -

2

6666666664

Total compaction cost = O(n3/2).

for general k, opt cost is ⇥(kn

1+1/k)

Definition: c-competitive online algorithm

A compaction algorithm is c-competitive if, on any input (k , x), itssolution costs at most c times the optimal cost.

A compaction algorithm is online if its choice of merge after flush tdepends only on k and x

1

, x2

, . . . , xt

(the files flushed so far).

I Default’s cost can be n times opt cost (for any k).

I So default is no better than n-competitive.

! May have high compaction cost even for “easy” inputs.

Theorem 1. There is a k-competitive online algorithm for bmc. today

Theorem 2. No deterministic online algorithm is less than k-competitive.

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

t.

?

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

t.

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

t.

?

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

t.

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges.

case 1 (during this interval, opt does a big merge):

Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):

Opt’ cost for small merges during interval is at least C .

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges.

case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):

Opt’ cost for small merges during interval is at least C .

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges.

case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):

Opt’ cost for small merges during interval is at least C .

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

alg. cost during interval is 2C

t.

Why 2-competitive? Focus on a time interval between two big merges.

case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):Opt’ cost for small merges during interval is at least C .

Idea behind k-competitive online algorithm for general k

‘idea: Do big merge, then recurse with k = k � 1.

Q: When to do next big merge?

A: When cost of previous big merge

⇡ (cost for recursion)/(k � 1).

Recurse with k = k � 1

to handle this part.

“Balanced rent-or-buy algorithm (brb)”

Recap of analyses in worst-case model

Bigtable default is at best n-competitive...

Theorem 1. Brb is a k-competitive online algorithm for bmc. today

Theorem 2. No deterministic online algorithm is less than k-competitive.

What about “typical” inputs?

Preliminary benchmarks (one example with k = 5)

0 500 1000 1500 2000

1e+0

52e

+05

3e+0

54e

+05

n

cost

per

ste

p

DefaultBRBOptimal

0e+00 4e+04 8e+040.0e

+00

4.0e

+06

8.0e

+06

1.2e

+07

n

cost

per

ste

p

DefaultBRB

xt

’s are i.i.d. from log-normal distribution.Conjectures

1. Brb and Opt cost per time step ⇠ x k n1/k/e.

2. Default cost per time step ⇠ x n/(2 · 3k�1).

Lots of work in progress

theoretical:

I average-case analyses:absolute and relative costs on i.i.d. inputs

I randomized online algorithms (o(k)-competitive?)I optimal compaction schedules

⌘ optimal binary search trees

practical:

I realistic testing. . . on AsterixDB, then at Google

problem variants:

I allow expiration/deletion of key/value pairs (done)I allowing k to vary — bmc w/ read costs... (open!)

Working paper available on arxiv.org

(Search web for “bigtable merge compaction”.)

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1

, y1

), (x2

, y2

), . . ..

do: Partition region below staircase into axis-parallel rectangles.

objective: Minimize the sum of the widths and heights of the rectangles.

x

1

y

1

x

2

y

2

x

3

y

3

x

4

y

4

x

5

y

5

x

6

y

6

x

7

y

7

open problem: is there an O(1)-competitive online algorithm?

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1

, y1

), (x2

, y2

), . . ..

do: Partition region below staircase into axis-parallel rectangles.

objective: Minimize the sum of the widths and heights of the rectangles.

x

1

y

1

x

2

y

2

x

3

y

3

x

4

y

4

x

5

y

5

x

6

y

6

x

7

y

7

open problem: is there an O(1)-competitive online algorithm?

Thank you

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1

, x2

, . . . , xn

. Int. k > 0.

do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.

objective: Minimize the sum of the widths of the rectangles.

x

1

x

2

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

10

input: an uneven staircase with 10 steps; k = 2.

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1

, x2

, . . . , xn

. Int. k > 0.

do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.

objective: Minimize the sum of the widths of the rectangles.

input: an uneven staircase with 10 steps; k = 2.

solution

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1

, x2

, . . . , xn

. Int. k > 0.

do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.

objective: Minimize the sum of the widths of the rectangles.

input: an uneven staircase with 10 steps; k = 2.

not a solution

This partition is cheaper. . .

but not valid for k = 2.

Recommended