Implementing a key/value store

50 AVENUE DES CHAMPS-ÉLYSÉES 75008 PARIS > FRANCE > WWW.OCTO.COM

Implementing a Key / Value Store

BluckDB

BJC - BOF - 15/12/16

github.com/BenJoyenConseil/bluckdb

@BenJoyeConseil

Pourquoi ...

◉ Comprendre les mécanismes des bases modernes

◉ Explorer les algos et les structures de donnée

◉ Faire du “bas niveau”

◉ Apprendre Go

◉ Ne pas être à poils lors de la formation HBase

Réinventer la roue ?

>01 Situer le kv store

Souvent présenté comme ça

Storage engine !

◉ LevelDB (LSM-Tree)

◉ RocksDB (LSM-Tree)

◉ WiredTiger (LSM-Tree)

◉ ForestDB (HB+Trie)

◉ InnoDB (B+Tree)

◉ BoltDB (B+Tree)

◉ Kyoto Cabinet (Hashtable)

◉ BluckDB (Hashtable)

◉ ...

database

Exemples de KV store comme moteur de stockage

LevelDB ... InnoDB

Server

MongoDB / MySQL / Riak / Lucene / ...

File system

Cockroachdb utilise RocksDB comme moteur de stockage

Quotable quote

All models are wrong but some are useful

— George Box

Models

>02 Deep Dive

Le design

1. Data storage abstraction

2. Data structure (index)

3. Memory management (page / block management, free space)

4. String / byte slice

5. Iterator / Cursor

6. Lock management

7. Comparator

Top 7 des composants dans un kv store (ref : article topito)

L’interface

type KVStore interface {

Get(k string) string

Put(k, v string)

Delete(k string)

}

First implem’

◉ Simple hashmap (separate chaining)

◉ Persistent store :

> Put -> append to file

> Get -> foreach line, split(‘:’)

MVP

bucket files (fixed number)

hash(k) % numBucket (static hashing)

File bucket 1

File bucket 2

File bucket 3 append

hash(k) % 3

First implem’

◉ Benchmarks persistent store

BenchmarkPutNaiveDiskKVStore-4 200000 6250 ns/op -> 6,2 µs

BenchmarkGetNaiveDiskKVStore-4 30 44017416 ns/op -> 44 ms

◉ Benchmarks in-memory hashmap

BenchmarkPutHashMap-4 1000000 1385 ns/op -> 1,3 µs

BenchmarkGetHashMap-4 2000000 711 ns/op -> 0,7 µs

MVP

Quotable quote

There’s clearly a trade-off between reads and writes, and it’s the mixing of

the two that causes all of the interesting challenges

— Adrian Colyer

https://twitter.com/adriancolyer

>03 Ok, on fait un vrai design maintenant ?

Hashtables are arguably the single most important data structure known

to mankind.

— Steve Yegge

Le design

1. Data storage abstraction -> SSD Page 4k + Record

2. Data structure (index) -> Hashtable (extendible hash)

3. Memory management (page / block management, free space) -> mmap + custom

4. String / byte slice -> Go string native conv to []byte slice

5. Iterator / Cursor -> Pattern iterator

6. Lock management -> à l’extérieur

7. Comparator -> multi-level comparator (key length > hash > byte)

Top 7 des composants dans un kv store (ref : article topito)

Record layout

type Record interface {

key() []byte

val() []byte

valLen() uint16 // min 0

keyLen() uint16 // max 65536

}

type ByteRecord []byte

r := ByteRecord(byteArray[204 : 249])

overhead : 4 bytes

Record layout

func (r ByteRecord) Write(key, val string) {

...

copy( r[ : ], key)

copy( r[ lenKey : ], val )

binary.LittleEndian.PutUint16( r[ total : ], lenVal )

binary.LittleEndian.PutUint16( r[ total + RECORD_HEADER_SIZE : ], lenKey )

}

serialization

... k e y v a l u e 0x5 0x0 0x3 0x0 ...

Page layout

type Page []byte

const (

PAGE_SIZE = 4096

PAGE_USE_OFFSET = 4094

PAGE_LOCAL_DEPTH_OFFSET = 4092

)

func (p Page) Use() int {

return int( binary.LittleEndian.Uint16( p[PAGE_USE_OFFSET : ] ) )

}

...

Record1 Record2 Record3 Record1 v2 LD USE

Extendible Hashing algorithm

dynamic hashing

◉ Une fonction de Hash qui génère des résultats sur un large segment — typiquement un int32

◉ Préfixe du résultat de la fonction de Hash pour calculer l’indice dans la table d’adresse.

◉ Plusieurs entrées dans la table d’adresse peuvent pointer sur la même page


p0

p0

p1

p2

..00

..01

..10

..11

...0100 key, value

...1101 key, value

ld=1

GD=2

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

Page

addresses


Après le split

p0

p4

p1

p2

..00

..01

..10

..11

...0100 key, value

ld=2

GD=2

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

Page

addresses

...1101 key, value

ld=2

page 4


Après le expand

p0

p4

p1

p2

p0

p4

p1

p2

.000

.001

.010

.011

.100

.101

.110

.111

...0100 key, value

ld=2

GD=3

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

...1101 key, value

ld=2

page 4

Directory layout

func (dir *Directory) extendibleHash(k util.Hashable) int {

return k.Hash() & ((1 << dir.Gd) - 1)

}

func (dir *Directory) getPage(k string) (Page) {

hash := dir.extendibleHash(util.Key(k))

id := dir.Table[hash]

offset := id * PAGE_SIZE

return Page(dir.data[offset : offset + PAGE_SIZE])

}

dir := &Directory{

Table: []int{0, 1, 3, 2, 0, 1, 3, 2},

Gd: 2,

LastPageId: 3,

data: []byte{...},

}Mmap

Core

Data Storage

Memory Management

Design

Bluck server

http.ListenBluckstore

open()

close()

DirectoryMmap

persistMeta()

KVStore

Get(k)

Put(k,v)

Delete(k)

PageGC()

ld()

Lock Management

RWMutex

File

RecordWrite()

Iterator

>04 On parle des perf ?

Memory maps are the best thing known to mankind after hash tables.

— Emmanuel Goossaert

Memory mapping

◉ Mmap file -> []byte

> 0 copy, pas de passage de user space à kernel space

> Pas de buffer à gérer pour le flush

> Pas de block cache

◉ Pré-allocation des pages pour accélérer mmap

Optims

Chronologie des évolutions

Les benchmarks (Go)

Get Put

Mmap 2344 5452

Update 2881 10532

Iterator 8796 11102

ByteRecord 1819 9982

GOB serde 1874 3206

Reverse 1406 1529

Flush Meta 1398 2786

Pre

allocation1408 1359

Les fonctionnalités qui font tout PÉTER !!!

◉ Update (in place ?)

◉ Metadata (consistency)

◉ Delete (shift? scan ?)

◉ Concurrency & Isolation

◉ Big record

$ go tool pprof

Avant :

Benchmark => itérations : 200.000 8796 ns/op 3376 B/op 106 allocs/op

(pprof) top10

1590ms of 1890ms total (84.13%)

Showing top 10 nodes out of 77 (cum >= 50ms)

flat flat% sum% cum cum%

500ms 26.46% 26.46% 920ms 48.68% runtime.mallocgc

260ms 13.76% 40.21% 260ms 13.76% runtime.heapBitsSetType

140ms 7.41% 47.62% 1630ms 86.24% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

140ms 7.41% 55.03% 1430ms 75.66% runtime.convT2I

120ms 6.35% 61.38% 120ms 6.35% runtime.memclr

120ms 6.35% 67.72% 120ms 6.35% runtime.memmove

cpu

http://github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

$ go tool pprof

Après :

Benchmark => itérations : 2.000.000 756 ns/op 16 B/op 1 allocs/op

(pprof) top10

2.33s of 2.36s total (98.73%)

Dropped 9 nodes (cum <= 0.01s)

Showing top 10 nodes out of 12 (cum >= 2.36s)

flat flat% sum% cum cum%

2.21s 93.64% 93.64% 2.33s 98.73% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

0.06s 2.54% 96.19% 0.11s 4.66% runtime.mallocgc

0.03s 1.27% 97.46% 0.03s 1.27% runtime.heapBitsSetType

0.02s 0.85% 98.31% 0.02s 0.85% runtime.scanobject

0.01s 0.42% 98.73% 0.12s 5.08% runtime.newobject

cpu

http://github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

Recap des compromis

◉ Mmap avec pré-allocation des pages : supprimer des enregistrements ne libère

pas l’espace disque

◉ Extendible Hashing & hashtable : performance pour l’accès, mais la persistance

des meta appart est très coûteuse et risquée (d’un point de vue cohérence)

◉ Delete : marquer les kv comme “deleted” augmente fortement le phénomène Write

Amplification versus shifting déplace beaucoup de données

◉ Update : “in place” il faut gérer la défragmentation versus append only il faut faire

du GC

◉ Concurrence : faire un hashtable threadsafe complexifie beaucoup le code (Mutex)

Recap des compromis

◉ La structure de donnée (Hashtable, LSM-Tree, B+Tree, Trie, etc...) définit le(s)

compromis

> Latence lecture

> Latence écriture

> Range scan

> Degré d’isolation

> Cohérence

> Haute dispo

◉ Choisissez !!

BluckDB

github.com/BenJoyenConseil/bluckdb

@BenJoyeConseil

Fork it

Engineering

Implementing a key/value store