88
SCYLLA: NoSQL at Ludicrous Speed Duarte Nunes @duarte_nunes

ScyllaDB: NoSQL at Ludicrous Speed

Embed Size (px)

Citation preview

SCYLLA: NoSQL at Ludicrous Speed

Duarte Nunes@duarte_nunes

❏ Introducing ScyllaDB❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

ScyllaDB

● Clustered NoSQL database compatible with Apache Cassandra

● ~10X performance on same hardware● Low latency, esp. higher percentiles● Self tuning● Mechanically sympathetic C++14

YCSB Benchmark:3 node Scylla cluster vs 3, 9, 15, 30Cassandra machines

3 Scylla30 Cassandra

3 Cassandra

3 Scylla

30 Cassandra

3 Cassandra

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study

Scylla and Cassandra handling the full load (peak of ~12M RPM)

200ms

10ms

20x Lower Latency

5

Scylla benchmark by Samsung

op/s

Full report: http://tinyurl.com/msl-scylladb

Dynamo-based system

Data model

Partition Key1Clustering Key1

Clustering Key1 Clustering Key2

Clustering Key2

...

...

...

...

...

CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');

Sort

ed b

y Pr

imar

y Ke

y

Log-Structured Merge Tree

SStable 1

Tim

e

Log-Structured Merge Tree

SStable 1

SStable 2

Tim

e

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4SStable 1+2+3

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

SStable 5SStable 1+2+3

Log-Structured Merge Tree

SStable 1

SStable 2

SStable 3Tim

e

SStable 4

SStable 5SStable 1+2+3

Foreground Job Background Job

Request path

SSTable

Memtable

Request path

SSTable

Memtable

Reads

Request path

SSTable

Memtable

Reads

Commit Log

Writes

Implementation Goals

● Efficiency:○ Make the most out of every cycle

● Utilization:○ Squeeze every cycle from the machine

● Control○ Spend the cycles on what we want, when we want

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

● Thread-per-core design (shard)○ No blocking. Ever.

Enter Seastar www.seastar-project.org

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

● Future/promise based APIs

Enter Seastar www.seastar-project.org

● Thread-per-core design (shard)○ No blocking. Ever.

● Asynchronous networking, file I/O, multicore

● Future/promise based APIs● Usermode TCP/IP stack included in the box

Seastar task schedulerTraditional stack Seastar stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a pointer to eventually computed value

Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread is a function pointer

Stack is a byte array from 64k to megabytes

Context switch cost is

high. Large stacks pollutes

the cachesNo sharing, millio

ns of

parallel events

Seastar memcached

Pedis https://github.com/fastio/pedis

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});

No escaping the monadfuture<> f = …;f.get(); // not allowed

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Unless...future<> f = seastar::async([&] {

future<> f = …;f.get();

});

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response

Seastar memory allocator

● Non-Thread safe!○ Each core gets a private memory pool

● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response

● Inter-core free() through message passing

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

Usermode I/O scheduler

Storage

Block Layer

Filesystem

Usermode I/O scheduler

Storage

Block Layer

Filesystem

Disk I/O Scheduler

Class A Class B

Usermode I/O scheduler

Query

Commitlog

Compaction

Queue

Queue

Queue

UserspaceI/O

SchedulerDisk

Figuring out optimal disk concurrency

Max useful disk concurrency

Cassandra cache

Linux page cache

SSTables

● 4k granularity● Thread-safe● Synchronous APIs● General-purpose● Lack of control2● ...on the other hand

○ Exists○ Hundreds of man-years○ Handling lots of edge cases

Cassandra cache

Linux page cache

SSTables

● Parasitic rows

SSTable page (4k)

Your data (300b)

Cassandra cache

Linux page cache

SSTables

● Page faults

Page faultSuspend thread

Initiate I/OContext switch

I/O completesContext switchInterrupt

Map pageResume thread

App thread

Kernel

SSD

Cassandra cache

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

● Complex tuning

Scylla cache

Unified cache

SSTables

Probabilistic Cache Warmup

● A replica with a cold cache should be sent less requests

Yet another allocator(Problems with malloc/free)

● Memory gets fragmented over time○ If the workload changes sizes of allocated objects○ Allocating a large contiguous block requires

evicting most of cache

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

Memory

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

MemoryOOM :(

Memory

Log-structured memory allocation

● Bump-pointer allocation to current segment● Frees leave holes in segments● Compaction will try to solve this

Compacting LSA● Teach allocator how to move objects around

○ Updating references● Garbage collect Compact!

○ Starting with the most sparse segments○ Lock to pin objects

● Used mostly for the cache○ Large majority of memory allocated○ Small subset of allocation sites

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing

AGENDA

● Internal feedback loops to balance competing loads○ Consume what you export

Workload Conditioning

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

WAN

CPU

Workload Conditioning

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU

Workload Conditioning

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

WAN

CPU

Workload Conditioning

Adjust priority

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Memory Monitor

WAN

CPU

Workload Conditioning

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Memory Monitor

Adjust priority

WAN

CPU

Workload Conditioning

❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Workload Conditioning❏ Closing

AGENDA

● Careful system design and control of the software stack can maximize throughput

● Without sacrificing latency● Without requiring complex end-user tuning● While having a lot of fun

Conclusions

● Download: http://www.scylladb.com● Twitter: @ScyllaDB● Source: http://github.com/scylladb/scylla● Mailing lists: scylladb-user @ groups.google.com● Slack: ScyllaDB-Users● Blog: http://www.scylladb.com/blog● Join: http://www.scylladb.com/company/careers● Me: [email protected]

How to interact

SCYLLA, NoSQL at Ludicrous SpeedThank you.

@duarte_nunes