ScyllaDB: NoSQL at Ludicrous Speed

SCYLLA: NoSQL at Ludicrous Speed

Duarte Nunes@duarte_nunes

❏ Introducing ScyllaDB❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

ScyllaDB

● Clustered NoSQL database compatible with Apache Cassandra

● ~10X performance on same hardware● Low latency, esp. higher percentiles● Self tuning● Mechanically sympathetic C++14

YCSB Benchmark:3 node Scylla cluster vs 3, 9, 15, 30Cassandra machines

3 Scylla30 Cassandra

3 Cassandra

3 Scylla

30 Cassandra

3 Cassandra

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study

Scylla and Cassandra handling the full load (peak of ~12M RPM)

200ms

10ms

20x Lower Latency

5

Scylla benchmark by Samsung

op/s

Full report: http://tinyurl.com/msl-scylladb

http://tinyurl.com/msl-scylladb

Dynamo-based system

Data model

Partition Key1Clustering Key1

Clustering Key1 Clustering Key2

Clustering Key2

...

...

...

...

...

CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');

Sort

ed b

y Pr

imar

y Ke

y

Log-Structured Merge Tree

SStable 1

Tim

e


SStable 1

SStable 2

Tim

e


SStable 1

SStable 2

SStable 3Tim

e


SStable 1

SStable 2

SStable 3Tim

e

SStable 4


SStable 1

SStable 2

SStable 3Tim

e

SStable 4SStable 1+2+3


SStable 1

SStable 2

SStable 3Tim

e

SStable 4



SStable 1

SStable 2

SStable 3Tim

e

SStable 4


Foreground Job Background Job

Request path

SSTable

Memtable

Request path

SSTable

Memtable

Reads

Request path

SSTable

Memtable

Reads

Commit Log

Writes

Implementation Goals

● Efficiency:○ Make the most out of every cycle

● Utilization:○ Squeeze every cycle from the machine

● Control○ Spend the cycles on what we want, when we want

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

● Thread-per-core design (shard)○ No blocking. Ever.

Enter Seastar www.seastar-project.org

http://www.seastar-project.org




● Asynchronous networking, file I/O, multicore






● Future/promise based APIs






● Future/promise based APIs● Usermode TCP/IP stack included in the box



Seastar task schedulerTraditional stack Seastar stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a pointer to eventually computed value

Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread is a function pointer

Stack is a byte array from 64k to megabytes

Context switch cost is

high. Large stacks pollutes

the cachesNo sharing, millio

ns of

parallel events

Seastar memcached

Pedis https://github.com/fastio/pedis

https://github.com/fastio/pedis

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});






No escaping the monadfuture<> f = …;f.get(); // not allowed

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool



● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response



● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response

● Inter-core free() through message passing


AGENDA

Usermode I/O scheduler

Storage

Block Layer

Filesystem


Storage

Block Layer

Filesystem

Disk I/O Scheduler

Class A Class B


Query

Commitlog

Compaction

Queue

Queue

Queue

UserspaceI/O

SchedulerDisk

Figuring out optimal disk concurrency

Max useful disk concurrency

Cassandra cache

Linux page cache

SSTables

● 4k granularity● Thread-safe● Synchronous APIs● General-purpose● Lack of control2● ...on the other hand

○ Exists○ Hundreds of man-years○ Handling lots of edge cases

Cassandra cache

Linux page cache

SSTables

● Parasitic rows

SSTable page (4k)

Your data (300b)

Cassandra cache

Linux page cache

SSTables

● Page faults

Page faultSuspend thread

Initiate I/OContext switch

I/O completesContext switchInterrupt

Map pageResume thread

App thread

Kernel

SSD

Cassandra cache

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

● Complex tuning

Scylla cache

Unified cache

SSTables

Probabilistic Cache Warmup

● A replica with a cold cache should be sent less requests

●

Yet another allocator(Problems with malloc/free)

● Memory gets fragmented over time○ If the workload changes sizes of allocated objects○ Allocating a large contiguous block requires

evicting most of cache

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

Memory

Log-structured memory allocation

● Bump-pointer allocation to current segment● Frees leave holes in segments● Compaction will try to solve this

Compacting LSA● Teach allocator how to move objects around

○ Updating references● Garbage collect Compact!

○ Starting with the most sparse segments○ Lock to pin objects

● Used mostly for the cache○ Large majority of memory allocated○ Small subset of allocation sites


AGENDA

● Internal feedback loops to balance competing loads○ Consume what you export

Workload Conditioning

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

WAN

CPU


Memtable


Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU


Memtable


Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU


Adjust priority

Memtable


Query

Repair

Commitlog

SSD

Memory Monitor

WAN

CPU


Memtable


Query

Repair

Commitlog

SSD

Memory Monitor

Adjust priority

WAN

CPU


❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Workload Conditioning❏ Closing

AGENDA

● Careful system design and control of the software stack can maximize throughput

● Without sacrificing latency● Without requiring complex end-user tuning● While having a lot of fun

Conclusions

● Download: http://www.scylladb.com● Twitter: @ScyllaDB● Source: http://github.com/scylladb/scylla● Mailing lists: scylladb-user @ groups.google.com● Slack: ScyllaDB-Users● Blog: http://www.scylladb.com/blog● Join: http://www.scylladb.com/company/careers● Me: [email protected]

How to interact

http://github.com/scylladb/scylla

https://join.slack.com/scylladb-users/shared_invite/MTc5ODE5MTk2MTk3LTE0OTQwNzY3OTEtN2E3YzVlNWNjNQ

http://www.scylladb.com/blog

http://www.scylladb.com/company/careers

SCYLLA, NoSQL at Ludicrous SpeedThank you.

@duarte_nunes

Technology

ScyllaDB: NoSQL at Ludicrous Speed