34
50 AVENUE DES CHAMPS-ÉLYSÉES 75008 PARIS > FRANCE > WWW.OCTO.COM Implementing a Key / Value Store BluckDB BJC - BOF - 15/12/16 github.com/BenJoyenConseil/bluckdb @BenJoyeConseil

Implementing a key/value store

Embed Size (px)

Citation preview

Page 1: Implementing a key/value store

50 AVENUE DES CHAMPS-ÉLYSÉES 75008 PARIS > FRANCE > WWW.OCTO.COM

Implementing a Key / Value Store

BluckDB

BJC - BOF - 15/12/16

github.com/BenJoyenConseil/bluckdb

@BenJoyeConseil

Page 2: Implementing a key/value store

Pourquoi ...

◉ Comprendre les mécanismes des bases modernes

◉ Explorer les algos et les structures de donnée

◉ Faire du “bas niveau”

◉ Apprendre Go

◉ Ne pas être à poils lors de la formation HBase

Réinventer la roue ?

Page 3: Implementing a key/value store

>01 Situer le kv store

Page 4: Implementing a key/value store

Souvent présenté comme ça

Storage engine !

Page 5: Implementing a key/value store

◉ LevelDB (LSM-Tree)

◉ RocksDB (LSM-Tree)

◉ WiredTiger (LSM-Tree)

◉ ForestDB (HB+Trie)

◉ InnoDB (B+Tree)

◉ BoltDB (B+Tree)

◉ Kyoto Cabinet (Hashtable)

◉ BluckDB (Hashtable)

◉ ...

database

Exemples de KV store comme moteur de stockage

LevelDB ... InnoDB

Server

MongoDB / MySQL / Riak / Lucene / ...

File system

Page 6: Implementing a key/value store

Cockroachdb utilise RocksDB comme moteur de stockage

Page 7: Implementing a key/value store

Quotable quote

All models are wrong but some are useful

— George Box

Models

Page 8: Implementing a key/value store
Page 9: Implementing a key/value store

>02 Deep Dive

Page 10: Implementing a key/value store

Le design

1. Data storage abstraction

2. Data structure (index)

3. Memory management (page / block management, free space)

4. String / byte slice

5. Iterator / Cursor

6. Lock management

7. Comparator

Top 7 des composants dans un kv store (ref : article topito)

Page 11: Implementing a key/value store

L’interface

type KVStore interface {

Get(k string) string

Put(k, v string)

Delete(k string)

}

Page 12: Implementing a key/value store

First implem’

◉ Simple hashmap (separate chaining)

◉ Persistent store :

> Put -> append to file

> Get -> foreach line, split(‘:’)

MVP

bucket files (fixed number)

hash(k) % numBucket (static hashing)

File bucket 1

File bucket 2

File bucket 3 append

hash(k) % 3

Page 13: Implementing a key/value store

First implem’

◉ Benchmarks persistent store

BenchmarkPutNaiveDiskKVStore-4 200000 6250 ns/op -> 6,2 µs

BenchmarkGetNaiveDiskKVStore-4 30 44017416 ns/op -> 44 ms

◉ Benchmarks in-memory hashmap

BenchmarkPutHashMap-4 1000000 1385 ns/op -> 1,3 µs

BenchmarkGetHashMap-4 2000000 711 ns/op -> 0,7 µs

MVP

Page 14: Implementing a key/value store

Quotable quote

There’s clearly a trade-off between reads and writes, and it’s the mixing of

the two that causes all of the interesting challenges

— Adrian Colyer

Page 15: Implementing a key/value store

>03 Ok, on fait un vrai design maintenant ?

Hashtables are arguably the single most important data structure known

to mankind.

— Steve Yegge

Page 16: Implementing a key/value store

Le design

1. Data storage abstraction -> SSD Page 4k + Record

2. Data structure (index) -> Hashtable (extendible hash)

3. Memory management (page / block management, free space) -> mmap + custom

4. String / byte slice -> Go string native conv to []byte slice

5. Iterator / Cursor -> Pattern iterator

6. Lock management -> à l’extérieur

7. Comparator -> multi-level comparator (key length > hash > byte)

Top 7 des composants dans un kv store (ref : article topito)

Page 17: Implementing a key/value store

Record layout

type Record interface {

key() []byte

val() []byte

valLen() uint16 // min 0

keyLen() uint16 // max 65536

}

type ByteRecord []byte

r := ByteRecord(byteArray[204 : 249])

overhead : 4 bytes

Page 18: Implementing a key/value store

Record layout

func (r ByteRecord) Write(key, val string) {

...

copy( r[ : ], key)

copy( r[ lenKey : ], val )

binary.LittleEndian.PutUint16( r[ total : ], lenVal )

binary.LittleEndian.PutUint16( r[ total + RECORD_HEADER_SIZE : ], lenKey )

}

serialization

... k e y v a l u e 0x5 0x0 0x3 0x0 ...

Page 19: Implementing a key/value store

Page layout

type Page []byte

const (

PAGE_SIZE = 4096

PAGE_USE_OFFSET = 4094

PAGE_LOCAL_DEPTH_OFFSET = 4092

)

func (p Page) Use() int {

return int( binary.LittleEndian.Uint16( p[PAGE_USE_OFFSET : ] ) )

}

...

Record1 Record2 Record3 Record1 v2 LD USE

Page 20: Implementing a key/value store

Extendible Hashing algorithm

dynamic hashing

◉ Une fonction de Hash qui génère des résultats sur un large segment — typiquement un int32

◉ Préfixe du résultat de la fonction de Hash pour calculer l’indice dans la table d’adresse.

◉ Plusieurs entrées dans la table d’adresse peuvent pointer sur la même page

Page 21: Implementing a key/value store

Extendible Hashing algorithm

p0

p0

p1

p2

..00

..01

..10

..11

...0100 key, value

...1101 key, value

ld=1

GD=2

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

Page

addresses

Page 22: Implementing a key/value store

Extendible Hashing algorithm

Après le split

p0

p4

p1

p2

..00

..01

..10

..11

...0100 key, value

ld=2

GD=2

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

Page

addresses

...1101 key, value

ld=2

page 4

Page 23: Implementing a key/value store

Extendible Hashing algorithm

Après le expand

p0

p4

p1

p2

p0

p4

p1

p2

.000

.001

.010

.011

.100

.101

.110

.111

...0100 key, value

ld=2

GD=3

...0110 key, value

...1110 key, value

...0111 key, value

...1011 key, value

ld=2

ld=2

page 1

page 2

page 3

...1101 key, value

ld=2

page 4

Page 24: Implementing a key/value store

Directory layout

func (dir *Directory) extendibleHash(k util.Hashable) int {

return k.Hash() & ((1 << dir.Gd) - 1)

}

func (dir *Directory) getPage(k string) (Page) {

hash := dir.extendibleHash(util.Key(k))

id := dir.Table[hash]

offset := id * PAGE_SIZE

return Page(dir.data[offset : offset + PAGE_SIZE])

}

dir := &Directory{

Table: []int{0, 1, 3, 2, 0, 1, 3, 2},

Gd: 2,

LastPageId: 3,

data: []byte{...},

}Mmap

Page 25: Implementing a key/value store

Core

Data Storage

Memory Management

Design

Bluck server

http.ListenBluckstore

open()

close()

DirectoryMmap

persistMeta()

KVStore

Get(k)

Put(k,v)

Delete(k)

PageGC()

ld()

Lock Management

RWMutex

File

RecordWrite()

Iterator

Page 26: Implementing a key/value store

>04 On parle des perf ?

Memory maps are the best thing known to mankind after hash tables.

— Emmanuel Goossaert

Page 27: Implementing a key/value store

Memory mapping

◉ Mmap file -> []byte

> 0 copy, pas de passage de user space à kernel space

> Pas de buffer à gérer pour le flush

> Pas de block cache

◉ Pré-allocation des pages pour accélérer mmap

Optims

Page 28: Implementing a key/value store

Chronologie des évolutions

Les benchmarks (Go)

Get Put

Mmap 2344 5452

Update 2881 10532

Iterator 8796 11102

ByteRecord 1819 9982

GOB serde 1874 3206

Reverse 1406 1529

Flush Meta 1398 2786

Pre

allocation1408 1359

Page 29: Implementing a key/value store

Les fonctionnalités qui font tout PÉTER !!!

◉ Update (in place ?)

◉ Metadata (consistency)

◉ Delete (shift? scan ?)

◉ Concurrency & Isolation

◉ Big record

Page 30: Implementing a key/value store

$ go tool pprof

Avant :

Benchmark => itérations : 200.000 8796 ns/op 3376 B/op 106 allocs/op

(pprof) top10

1590ms of 1890ms total (84.13%)

Showing top 10 nodes out of 77 (cum >= 50ms)

flat flat% sum% cum cum%

500ms 26.46% 26.46% 920ms 48.68% runtime.mallocgc

260ms 13.76% 40.21% 260ms 13.76% runtime.heapBitsSetType

140ms 7.41% 47.62% 1630ms 86.24% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

140ms 7.41% 55.03% 1430ms 75.66% runtime.convT2I

120ms 6.35% 61.38% 120ms 6.35% runtime.memclr

120ms 6.35% 67.72% 120ms 6.35% runtime.memmove

cpu

Page 31: Implementing a key/value store

$ go tool pprof

Après :

Benchmark => itérations : 2.000.000 756 ns/op 16 B/op 1 allocs/op

(pprof) top10

2.33s of 2.36s total (98.73%)

Dropped 9 nodes (cum <= 0.01s)

Showing top 10 nodes out of 12 (cum >= 2.36s)

flat flat% sum% cum cum%

2.21s 93.64% 93.64% 2.33s 98.73% github.com/BenJoyenConseil/bluckdb/bluckstore/mmap.Page.get

0.06s 2.54% 96.19% 0.11s 4.66% runtime.mallocgc

0.03s 1.27% 97.46% 0.03s 1.27% runtime.heapBitsSetType

0.02s 0.85% 98.31% 0.02s 0.85% runtime.scanobject

0.01s 0.42% 98.73% 0.12s 5.08% runtime.newobject

cpu

Page 32: Implementing a key/value store

Recap des compromis

◉ Mmap avec pré-allocation des pages : supprimer des enregistrements ne libère

pas l’espace disque

◉ Extendible Hashing & hashtable : performance pour l’accès, mais la persistance

des meta appart est très coûteuse et risquée (d’un point de vue cohérence)

◉ Delete : marquer les kv comme “deleted” augmente fortement le phénomène Write

Amplification versus shifting déplace beaucoup de données

◉ Update : “in place” il faut gérer la défragmentation versus append only il faut faire

du GC

◉ Concurrence : faire un hashtable threadsafe complexifie beaucoup le code (Mutex)

Page 33: Implementing a key/value store

Recap des compromis

◉ La structure de donnée (Hashtable, LSM-Tree, B+Tree, Trie, etc...) définit le(s)

compromis

> Latence lecture

> Latence écriture

> Range scan

> Degré d’isolation

> Cohérence

> Haute dispo

◉ Choisissez !!

Page 34: Implementing a key/value store

BluckDB

github.com/BenJoyenConseil/bluckdb

@BenJoyeConseil

Fork it