Cloudius Systems presents - Percona · Cloudius Systems presents: NoSQL goes NATIVE Avi Kivity...

Preview:

Citation preview

Cloudius Systems presents:

NoSQL goes NATIVEAvi Kivity (@AviKivity)

Percona Live 2016

Capable of 1,000,000 operations per secondPER NODE

With predictable, low latenciesCompatible with Apache Cassandra

Scylla: A new NoSQL Database

BACKGROUND

SQL: Structured, no scale

Document store: No structureSome scale

Column store: Some structureScale outAwesome HA/DR

Key-value: SimpleScaleNot a real DB

YESSCALE UP OR SCALE OUT?

CPUs AND DISKS ARE NETWORK DEVICES

THE POWER OFCASSANDRA AT THE SPEED OF REDIS

SOLUTION: SCYLLA DB

AWESOME REDUNDANCY & HA

+ Multi DC

+ Spark

+ CQL

+ Auto sharding

+ Wide rows

RESULTS: THROUGHPUT/SCALE UP

LATENCY - AVG

LATENCY - P99

FULLY COMPATIBLE

SELF TUNING

No need to guess thread pool sizes

No need to guess compaction/streaming bandwidth

No fiddling with heap sizes, on/off heap

No endless tweaking of Garbage Collector parameters

WHAT WOULD YOU DO WITH 1 MILLION TPS?

Shrink your cluster by a factor of X10

Handle 10X traffic spikes on Black Friday

Model your data instead of use Cassandra as K/V

Get the most out of your data - Run more queries

Maintain your clusters while serving

Stop using caches in front of the database

TECHNOLOGY:HOW IT WORKS

SeaStar

Before: Thread model After: SeaStar shards

SCYLLA IS QUITE DIFFERENT

Shard-per-core, no locks, no threads, zero-copy

Reactor programing with C++14

Our own efficient,DB-aware cache, not using Linux page cache

Better storage engine than cassandra

Max out all HW resources - NUMA friendly, multiqueue NICs, etc

SCYLLA DB: NETWORK COMPARISON

Kernel

Cassandra

TCP/IPScheduler

queuequeuequeuequeuequeue

threads

NIC

Queues

Kernel

Traditional stack SeaStar’s sharded stack

Memory

Application

TCP/I

P

Task Scheduler

queuequeuequeuequeuequeue

smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler

queuequeuequeuequeuequeue

smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Application

TCP/I

P

Task Scheduler

queuequeuequeuequeuequeue

smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Core

Database

TCP/I

P

Task Scheduler

queuequeuequeuequeuequeue

smp queue

NIC

Queue

DPDK

Kernel

(isn’t

involved)

Userspace

Scylla has its own task schedulerTraditional stack Scylla’s stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a

pointer to

eventually

computed value

Task is a

pointer to a

lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Threa

d

Stack

Thread is a

function pointer

Stack is a byte

array from 64k

to megabytes

Unified cacheCassandra Scylla

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

TuningParasitic rowsPage faults

total, 14825839, 25002, 25002, 25002, 0.5, 0.3, 0.5, 5.0, 12.8, 22.2, 592.6, 0.00076

total, 14851605, 24980, 24980, 24980, 0.5, 0.3, 0.6, 6.7, 12.9, 21.6, 593.6, 0.00076

total, 14877443, 25004, 25004, 25004, 0.5, 0.3, 0.5, 6.5, 17.5, 38.8, 594.7, 0.00076

total, 14903361, 25017, 25017, 25017, 0.5, 0.3, 0.5, 6.9, 29.9, 39.9, 595.7, 0.00076

total, 14927655, 23553, 23553, 23553, 4.6, 0.3, 34.3, 66.8, 203.0, 255.9, 596.7, 0.00076

total, 14956055, 26384, 26384, 26384, 5.0, 0.4, 27.2, 53.9, 81.5, 99.9, 597.8, 0.00077

total, 14981910, 24987, 24987, 24987, 0.5, 0.3, 0.7, 6.2, 13.5, 25.0, 598.8, 0.00077

total, 15007673, 25003, 25003, 25003, 0.4, 0.3, 0.5, 3.7, 12.5, 24.5, 599.9, 0.00077

total, 15033484, 25006, 25006, 25006, 0.4, 0.3, 0.5, 3.8, 12.4, 32.8, 600.9, 0.00077

total, 15059256, 25004, 25004, 25004, 0.4, 0.3, 0.5, 2.2, 14.9, 33.0, 601.9, 0.00076

total, 15085126, 24994, 24994, 24994, 0.4, 0.3, 0.5, 2.4, 10.4, 19.4, 603.0, 0.00076

total, 15110948, 24988, 24988, 24988, 0.5, 0.3, 0.6, 4.1, 10.3, 19.9, 604.0, 0.00076

Compatibility (and speed): Repair

Jan 13 17:33:44 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbbf8dc1-ba1b-11e5-a51a-00000000000b ID#0]

Creating new streaming plan for repair-in

Jan 13 17:33:44 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbbf8dc1-ba1b-11e5-a51a-00000000000b ID#0]

Received streaming plan for repair-in

Jan 13 17:33:45 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbc02a01-ba1b-11e5-a51a-00000000000b ID#0]

Creating new streaming plan for repair-in

Jan 13 17:33:45 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbc02a01-ba1b-11e5-a51a-00000000000b ID#0]

Received streaming plan for repair-in

I/O Scheduling

Query

Commitlog

Compaction

Queue

Queue

Queue

I/O

SchedulerDisk

STATUS

1.0 Released

Major CQL feature support

Good match for upcoming Non Volatile Memory technology

High availability with gossip and flexible replication

Runs everywhere: Physical, virtual, containers

Can be integrated with microservices with its own httpd

❏Build a community❏Core database improvements❏VERTICAL: Spark, Solr, distributed SQL engines❏HORIZONTAL: Microservice integration, more

protocols

WHAT’S NEXT?

SCYLLA, NoSQL GOES NATIVEThank you.

Basic model

■ Futures

■ Promises

■ Continuations

F-P-C defined: Future

A future is a result of a computation

that may not be available yet. ■ Data buffer from the network

■ Timer expiration

■ Completion of a disk write

■ Result computation that requires the values from one or

more other futures.

F-P-C defined: Promise

A promise is an object or function

that provides you with a future, with

the expectation that it will fulfil the

future.

F-P-C defined: Continuation

A continuation is a computation that

is executed when a future becomes

ready (yielding a new future).

Basic future/promise

future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int

void f() {get().then([] (int value) {

put(value + 1).then([] {std::cout << "value stored successfully\n";

});});

}

Chaining

future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int

void f() {get().then([] (int value) {

return put(value + 1);}).then([] {

std::cout << "value stored successfully\n";});

}

Parallelism

void f() {

std::cout << "Sleeping... " << std::flush;

using namespace std::chrono_literals;

sleep(200ms).then([] { std::cout << "200ms " << std::flush; });

sleep(100ms).then([] { std::cout << "100ms " << std::flush; });

sleep(1s).then([] { std::cout << "Done.\n"; engine_exit(); });

}

Zero copy friendly

future<temporary_buffer> connected_socket::read(size_t n);

■ temporary_buffer points at driver-provided pages if possible

■ discarded after use

Zero copy friendly (2)

future<size_t>connected_socket::write(temporary_buffer);

■ Future becomes ready when TCP window allows sending more data (usually immediately)

■ temporary_buffer discarded after data is ACKed■ can call delete[] or decrement a reference count

Dual Networking Stack

Networking API

Seastar (native) Stack POSIX (hosted) stack

Linux kernel (sockets)

User-space TCP/IP

Interface layer

DPDK

Virtio Xen

igb ixgb

Disk I/O

■ Zero copy using Linux AIO and O_DIRECT■ Some operations using worker threads (open()

etc.)■ Plans for direct NVMe support

Recommended