Online Bigtable merge compaction - UCRneal/Slides/bigtable_merge_compaction.pdf · 2015-09-22 ·...

Online Bigtable merge compaction

work in progress

Claire Mathieu

CNRS Paris

Carl Staelin

Google Haifa

Neal E. Young

UC Riverside

Arman Yousefia

instigator memy student

you are here

this is now

Northeastern University, September 17, 2015

1funded by faculty re$earch award

BIGTABLE — data storage at

Google Maps, Search/Crawl, Gmail . . . use BIGTABLE to store data.

I 24,500 Bigtable Servers

I 1.2 million requests per second

I 16 GB/s of outgoing RPC tra�c

I over a petabyte of data just for Google Crawl and Analytics

I these figures are from 2006

Similar to other “NoSQL” databases:

Accumulo, AsterixDB, Cassandra, HBase, Hypertable, Spanner, . . .

Used by Adobe, Ebay, Facebook, GitHub, Meetup, Netflix, Twitter, . . .

“Log-structured merge tree” architecture— for high-volume, highly reliable, distributed, real-time data storage.

BIGTABLE — implements dictionary data type

operations supported by a Bigtable instance:

I write(key, value)

I read(key) — return most recent value written for key

I. . . there’s more, but not today . . .

BIGTABLE — writes and flushes

write(key, value):

1. Store key/value pair in cache (e.g. hash table in RAM).

Environment periodically forces flush of cache to new immutable disk file.

Example

cache: –empty–

file sequence

write(key, value):

Example

cache: (1, a)

file sequence

write(1, a);

write(key, value):

Example

cache: (1, a) (2, b)

file sequence

write(1, a); write(2, b);

write(key, value):

Example

cache: (1, a) (2, b) (3, c)

file sequence

write(1, a); write(2, b); write(3, c);

write(key, value):

Example

cache: (1, a) (2, b) (3, c) (4, d)

file sequence

write(1, a); write(2, b); write(3, c); write(4, d);

write(key, value):

Example

cache: –empty–

file sequence: (1, a) (2, b) (3, c) (4, d)| {z }

from 1st flush

write(1, a); write(2, b); write(3, c); write(4, d); flush();

write(key, value):

Example

cache: (5, e) (6, f ) (7, g)

from 1st flush

write(5, e); write(6, f ); write(7, g);

write(key, value):

Example

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

write(5, e); write(6, f ); write(7, g); flush();

write(key, value):

Example

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

(8, h) (9, i)| {z }from 3rd flush

write(5, e); write(6, f ); write(7, g); flush();

write(8, h); write(9, i); flush();

write(key, value):

Example

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

Environment forces Flushes at arbitrary times.

BIGTABLE — reads and compactions

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

read(key):

1. Check cache for key.2. If not found, check files (most recent first). cost = O(#files)

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g)| {z }

from 2nd flush

read(key):

compaction(): asynchronous background process, to reduce read costs

Periodically select files to merge.

cache: –empty–

from 1st flush

(5, e) (6, f ) (7, g) (8, h) (9, i)| {z }

merge of 2nd and 3rd

read(key):

compaction(): asynchronous background process, to reduce read costs

Periodically select files to merge. cost = O(SIZE of merged files) !!

goals: (i) keep read costs low(ii) keep compaction costs low

constraint: each merge must merge a contiguous subsequence of files

Bigtable Merge Compaction (bmc) — formal definition

given: Sequence x1

, . . . , xn

is size of file resulting from flush t

Integer k > 0. tuned to workload; typically 3–40.

choose: Compactions. Ensure number of files never exceeds k .

objective: Minimize total compaction cost.

given: Sequence x1

, . . . , xn

If k =1, problem is easy — never merge

given: Sequence x1

, . . . , xn

after flush 1:

given: Sequence x1

, . . . , xn

after flush 1:after flush 2:

given: Sequence x1

, . . . , xn

after flush 1:after flush 2:after flush 3:after flush 4:

...Total compaction cost = 0.

given: Sequence x1

, . . . , xn

If k = 1, problem is easy — must merge everything each time

after flush 1:

given: Sequence x1

, . . . , xn

after flush 1:after flush 2: too many files!

given: Sequence x1

, . . . , xn

after flush 1:after flush 2: compaction cost x

given: Sequence x1

, . . . , xn

after flush 3: too many files!

given: Sequence x1

, . . . , xn

after flush 3: compaction cost x1

given: Sequence x1

, . . . , xn

after flush 3: compaction cost x1

...after flush n: compaction cost x

+ · · ·+ xn

Total compaction costP

+ · · ·+ xi

) ⇡P

(n� i +1)xi

Google’s default compaction algorithm:

Merge minimal su�x so as to maintain (i) #files k and(ii) each file’s size exceeds total size of files to the right.

Example: k = 2, on uniform input x = 1, 1, 1, . . .:

Total compaction cost = ⇥(n2).

66666666666666666664

for general k, cost is ⇥(n

2/3k�1

OPTIMAL solution for k = 2, uniform x = 1, 1, 1, . . .1.

“big” merges: O(pn), of size O(n)

“small” merges: O(n), of size O(pn)

6666666664

Total compaction cost = O(n3/2).

for general k, opt cost is ⇥(kn

1+1/k)

Definition: c-competitive online algorithm

A compaction algorithm is c-competitive if, on any input (k , x), itssolution costs at most c times the optimal cost.

A compaction algorithm is online if its choice of merge after flush tdepends only on k and x

, . . . , xt

(the files flushed so far).

I Default’s cost can be n times opt cost (for any k).

I So default is no better than n-competitive.

! May have high compaction cost even for “easy” inputs.

Theorem 1. There is a k-competitive online algorithm for bmc. today

Theorem 2. No deterministic online algorithm is less than k-competitive.

Idea behind 2-competitive online algorithm (for k = 2)...

Q: At each step, do “big” merge or small merge?

A: Do big merge when cost C of previous big merge⇡ total cost of small merges since then.

s. previous big merge, cost C

alg. cost during interval is 2C

Why 2-competitive? Focus on a time interval between two big merges.

case 1 (during this interval, opt does a big merge):

Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):

Opt’ cost for small merges during interval is at least C .

case 1 (during this interval, opt does a big merge):Opt’s cost for big merge during interval is at least C .

case 2 (during this interval, opt does no big merge):Opt’ cost for small merges during interval is at least C .

Idea behind k-competitive online algorithm for general k

‘idea: Do big merge, then recurse with k = k � 1.

Q: When to do next big merge?

A: When cost of previous big merge

⇡ (cost for recursion)/(k � 1).

Recurse with k = k � 1

to handle this part.

“Balanced rent-or-buy algorithm (brb)”

Recap of analyses in worst-case model

Bigtable default is at best n-competitive...

Theorem 1. Brb is a k-competitive online algorithm for bmc. today

Theorem 2. No deterministic online algorithm is less than k-competitive.

What about “typical” inputs?

Preliminary benchmarks (one example with k = 5)

0 500 1000 1500 2000

DefaultBRBOptimal

0e+00 4e+04 8e+040.0e

DefaultBRB

’s are i.i.d. from log-normal distribution.Conjectures

1. Brb and Opt cost per time step ⇠ x k n1/k/e.

2. Default cost per time step ⇠ x n/(2 · 3k�1).

Lots of work in progress

theoretical:

I average-case analyses:absolute and relative costs on i.i.d. inputs

I randomized online algorithms (o(k)-competitive?)I optimal compaction schedules

⌘ optimal binary search trees

practical:

I realistic testing. . . on AsterixDB, then at Google

problem variants:

I allow expiration/deletion of key/value pairs (done)I allowing k to vary — bmc w/ read costs... (open!)

Working paper available on arxiv.org

(Search web for “bigtable merge compaction”.)

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1

), (x2

), . . ..

do: Partition region below staircase into axis-parallel rectangles.

objective: Minimize the sum of the widths and heights of the rectangles.

open problem: is there an O(1)-competitive online algorithm?

Bmc with read costs (geometric interpretation)

given: Staircase step-lengths and step-heights (x1

), (x2

), . . ..

do: Partition region below staircase into axis-parallel rectangles.

objective: Minimize the sum of the widths and heights of the rectangles.

open problem: is there an O(1)-competitive online algorithm?

Thank you

A geometric interpretation of bmc

given: Uneven staircase with step-lengths x1

, . . . , xn

. Int. k > 0.

do: Partition region below staircase into axis-parallel rectangles,so no row has more than k rectangles.

objective: Minimize the sum of the widths of the rectangles.

input: an uneven staircase with 10 steps; k = 2.

, . . . , xn

. Int. k > 0.

solution

, . . . , xn

. Int. k > 0.

not a solution

This partition is cheaper. . .

but not valid for k = 2.

Online Bigtable merge compaction - UCRneal/Slides/bigtable_merge_compaction.pdf · 2015-09-22 ·...

Documents

Swim Back Crawl to Swim Front Crawl

What's Next for Google's BigTable

Lecture 8 gooble bigtable

BigTable Search Presentation to Austin PUG

Breast Crawl: Initiation of Breastfeeding by Breast Crawl

Pipelined Compaction for the LSM-Treehebs/pub/zigangpipe_ipdps14.pdf · key-value storage systems like BigTable and Cassandra. Due to deferral and batching, the LSM-tree based storage

Bigtable and Dynamo

GOOGLE BIGTABLE - Universitetet i osloData Model A cluster is a set of machines with Bigtable processes Each Bigtable cluster serves a set of tables A table is a sparse, distributed,

Cassandra: Open Source Bigtable + Dynamo

Bigtable: A Distributed Storage System for Structured Data · 2016-01-09 · Introduction •Bigtable is a distributed storage system for managing structured data. •Goals of Bigtable

Compaction Machines, Compaction Equipment, Compaction

COMPACTION COMPACTION TECHNOLOGY - Ammann · COMPACTION TECHNOLOGY AMMANN TECHNOLOGY PORTFOLIO COMPACTION. 2 INTELLIGENT COMPACTION GETS EVEN SMARTER Precise, transparent and verifiable

Compaction, Compaction Everywhere

Bigtable Uw Presentation

Bigtable Raw

Key-Value stores (BigTable)ocw.upc.edu/.../2015/1/55027/05-bigtable-5693.pdf · Based on Google’s Bigtable Ta ble(Key Based on Googles Designed to meet the following requirements

Great BigTable and my toys

BigTable: A Distributed Storage System for Structured Datacourse/DBMS/class/bigtable/bigtable.pdf · • BigTable is a working system catering the needs for a wide variety of applications

bigtable - courses.cs.washington.edu

The compelling case for vulnerability management€¦ · => towards: CRD on steroids 10 Crawl and scan Crawl and scan Crawl and scan Crawl and scan Crawl and scan Crawl and scan networks