Nibiru: Building your own NoSQL store

Preview:

Citation preview

1

Building a nosql from scratchLet them know what they are missing!

#ddtx16@edwardcapriolo@HuffPostCode

2

If you are looking for

A battle tested NoSQL data store That scales up to 1 million transactions a second Allows you to query data from your IoT sensors in real time You are at the wrong talk! This is a presentation about Nibiru An open source database I work on in my spare time But you should stay anyway...

3

Motivations Why do that? How this got started? What did it morph into? Many NoSQL databases came out of an industry specific use

case and as a result they had baked in assumptions. If we have clean interfaces and good abstractions we can make a better general tool with lessed forced choices.

Pottentially support a majority of the use cases in one tool.

4

A friend asked

Won't this make Nibiru have all the bugs of all the systems?

5

My response

Jerk!

6

You might want to follow along with local copy

There are a lot of slides that have a fair amount of code https://github.com/edwardcapriolo/nibiru/blob/master/hexagon

s.ppt http://bit.ly/1NcAoEO

7

Basics

8

Terminology

Keyspace: A logical grouping of store(s) Store: A structure that holds data

Avoided: Column Family, Table, Collection, etc Node: a system Cluster: a group of nodes

9

Assumptions & Design notes

A store is of a specific type Key Value, Column Family, etc The API of the store is dictated by the type Ample gotchas from one man, after work, project Wire components together, not into a large context Using string (for now) instead of byte[] for debug

10

Server ID

We need to uniquely identify each node Hostname/ip is not good solution

Systems have multiple Can change

Should be able to run N copies on single node

11

Implementation

On first init() create guid and persist

12

Cluster Membership

13

Cluster Membership

What is a list of nodes in the cluster? What is the up/down state of each node?

14

Static Membership

15

Different cluster membership models

Consensus/Gossip Cassandra Elastic Search

Master Node/Someone elses problem HBase (zookeeper)

16

Gossip

http://www.joshclemm.com/projects/

17

Teknek Gossip

Licenced Apache V2 Forked from google code project Available from maven g: io.teknek a: gossip Great tool for building a peer-to-peer service

18

Cluster Membership using Gossip

19

Get Live Members

20

Gutcheck

Did clean abstractions hurt the design here? Does it seem possible we could add zookeeper/etcd as a

backend implemention? Any takers? :)

21

Request Routing

22

Some options

So you have a bunch of nodes in a cluster, but where the heck does the data go? Client dictated - like a sharded memcache|mysql|whatever HBase - Sharding with a leader election Dynamo Style - ring topology token ownership

23

Router & Partitioners

24

Pick your poison: no hot spots or key locality :)

25

Quick example LocalPartitioner

26

Scenario: using a Dynamo-ish router

Construct a three node topology Give each an id Give them each a token Test that requests route properly

27

Cluster and Token information

28

Unit Test

29

Token Router

30

Do the Damn Thing!

31

Do the Damn Thing! With Replication

32

Storage Layer

33

Basic Data Storage SSTables

SS = Sorted String { 'a', $PAYLOAD$ },{ 'b', $PAYLOAD$ }

34

LevelDB SSTable payload

Key Value implementation SortedMap<byte, byte>

{ 'a', '1' }, { 'b', '2' }

35

Cassandra SSTable Implementation

Key Value in which value is a map with last-update-wins versioning

SortedMap<byte, SortedMap <byte, Val<byte,long>>

{ 'a', { 'col':{ 'val', 1 } } }, { 'b', {

'col1':{ 'val', 1 }, 'col2':{ 'val2', 2 }

} }

36

HBase SSTable Implementation

Key-Value in which value is a map with multi-versioning

SortedMap<byte, SortedMap <byte, Val<byte,long>>

{ { 'a', { 'col':{ 'val', 1 } } },

{ 'b', { 'col1':{ 'val', 1 },

'col1':{ 'valb', 2 }, 'col2':{ 'val2', 2 }

} }}

37

Column Family Store high level

38

Operations to support

39

One possible memtable implementation

Holy Generics batman! Isn't it just a map of map?

40

Unforunately no!

Imagine two requests arrive in this order: set people [edward] [age]='34' (Time 2) set people [edward] [age]='35' (Time 1)

What should be the final value? We need to deal with events landing out of order Also exists delete write known as Tombstone

41

And then, there is concurrency

Multiple threads manipulating at same time Proposed solution: (Which I think is correct)

Do not compare and swap value, instead append to queue and take a second pass to optimize

42

43

Optimization 1: BloomFilters

Use guava. Smart! Audiance: make disapointed aww sound because Ed did not

write it himself

44

Optimization 2: IndexWriter

Not ideal to seek a disk like you would seek memory

45

Consistency

46

Multinode Consistency

Replication: Number of places data lives Active/Active Master/Slave (with takover) Resolving conflicted data

47

Quorum Consistency Active/Active Implemantation

48

Message dispatched

49

Asyncronos Responses T1

50

Asyncronos Responses T2

51

Logic to merge results

52

Breakdown of components

Start & dedline : Max time to wait for requests Message : The read/write request sent to each destination Merger : Turn multiple responses into single result

53

54

Testing

55

Challenges of timing in testing

Target goal is ~ 80% unit 20% integetration (e2e) testing Performance varies in local vs travis-ci Hard to test something that typically happens in milliseconds

but at worst case can take seconds Lazy half solution: Thread.sleep() statements for worst case

Definately a slippery slope

56

Introducing TUnit

https://github.com/edwardcapriolo/tunit

57

The End

Recommended