Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for...

Modern Databases: The Return of SQL with the Scaling Power of NoSQL

Dhruba Borthakur/ CTO and Co-founder of RocksetDeveloperWeek 2020

About Me

● CTO and Co-Founder of Rockset● Founding eng of RocksDB @ Facebook● Founding eng of Hadoop FS @ Yahoo● HBase & Hive developer● Andrew File System (AFS) developer● Univ of Wisconsin, Madison student

Two flavors of databases for analytics

SQL databases NoSQL databases

● well-defined standard query language

● structured data● complex analytics● requires substantial data cleaning

and preparation

● scalable● fast● no upfront schemas● not tuned for complex analytics● may need to learn a new query

language

The modern database- takes the best aspects of SQL and NoSQL

SQL on NoSQL

Modern databases have these characteristics

Schema with every record rather than every tableto handle semi-structured data

Declarative way to express complex logicfor query complexity and widespread adoption

Distributed systemsfor scale and efficiency

Low latencyfor fast insights on fresh data

SQL databases NoSQL databases

There are 2 additional considerations for modern databases

1. Data complexityIndexing for low query latency

2. Data volume Decoupling ingest and query compute for low data latency

Enter RocksetThe Real-time Database in the

This talk will cover the architectural decisions behind Rockset including:

1. Relational document model

2. Smart schemas: A schema with every record

3. Converged indexing: 3-way indexes on your data

4. Aggregator Leaf Tailer Architecture (ALT): Cloud-native scaling for low data latency

Relational document model

Document vs. relational models

Document Model

Relational Model

Relational diagram taken from Adam Adamowicz “How to create ER diagram for existing SQL server database with SSMS”

Rockset’s relational document data model

Design characteristics of the relational document data model

● No upfront schema required before adding records to the database● No data denormalization (natively supports semi-structured formats

JSON, Parquet, XML, CVS, and TSV)● All data is indexed● Custom SQL extensions for dealing with nested documents and

arrays

Converged Indexingfor low query latency

Traditional Databases: 1 Database = 1 Index

DruidColumnar Index

ElasticInverted Index

RedisRow Index

Data QueryETL

ISSUES 1. ETL Pain 2. Consistency Battles 3. Replication across 3 systems 4. Query across 3 systems 5. Manage indexes

What is converged indexing?

Rockset stores every column of every document in a row-based index, column-based index, AND an inverted index.

● No need to configure and maintain indexes● No slow queries because of missing indexes

INDEX ALL FIELDS

Converged indexing under the hood

● Columnar and inverted indexes in the same system● Built on top of key-value store abstraction● Each document maps to many key-value pairs

<doc 0> { “name”: “Igor” }

<doc 1> { “name”: “Dhruba” }

Key Value

R.0.name Igor Row Store

R.1.name Dhruba

C.name.0 Igor Column Store

C.name.1 Dhruba

S.name.Dhruba.1 Inverted index

S.name.Igor.0

Inverted index for point lookups

● For each value, store documents containing that value (posting list)● Quickly retrieve a list of document IDs that match a predicate

“name”

“interests”

Dhruba 1

Igor 0

databases 0.0; 1.1

cars 1.0

snowboarding 0.1

“last_active”2019/3/15 0

2019/3/22 1

<doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 }

<doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

Columnar index for aggregations

● Store each column separately● Great compression● Only fetch columns the query needs

<doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 }

<doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

“name”

“interests”

0 Igor

1 Dhruba

0.0 databases

0.1 snowboarding

1.0 cars

1.1 databases

“last_active”0 2019/3/15

1 2019/3/22

Query optimizer selects the index

● Low latency for both highly selective queries and large scans● Optimizer picks between columnar store or inverted index

SELECT keyword, count(*) FROM search_logs GROUP BY keyword ORDER BY count(*) DESC

Inverted index(for highly selective queries)

SELECT * FROM search_logs WHERE keyword = ‘hpts’ AND locale = ‘en’

Columnar store(for large scans)

Smart schemasfor schematizing every record rather than every table

Schemas change with time in the modern world

The modern world needs to 1)handle schema changes automatically

2) avoid data cleaning with schemas

What are smart schemas?

Automatic generation of a schema based on the exact fields and types present at the time of ingest. There is no need to ingest the data with a predefined schema (ie: schemaless ingest).

● No need to manage pipelines that cause data latency● Handles semi-structured data (adopting best practices from NoSQL)● Makes it easy to read the data (using standard SQL)● Filters the data at query time (no need for data cleaning)

Smart schemas under the hood

● Type information stored with values not columns

● Strongly typed queries on dynamically typed fields

● Designed for nested semi-structured data

Challenges with Smart Schemas

1. Requires increased disk space to store types2. Requires additional CPU usage for processing queries

Field interning reduces the space required to store schemas

Strict Schema(Relational DB)

Schema Data Storage

City: String

name: String

age: Int

Rancho Santa Margarita Jonathan 35

Rancho Santa Margarita Katherine 25

Schemaless(JSON DB)

“City”: “Rancho Santa Margarita” “name”: “Jonathan” “age”: 35

“City”: “Rancho Santa Margarita” “Name”: “Katherine” “age”: 25

Rockset(with field interning)~30% overhead

0: S “City” 1: S “Jonathan”

2: S “Katherine” 3: I 35 4: I 25 City: 0 name: 1 age: 3

City: 0 name: 2 age: 4

= Amount of storage

Type hoisting reduces CPU required to run queries

Rockset query performance is on par with strict schema systems

Strict schema 1 10 7 4 5

a b c d e

Schemaless

Rockset(with type hoisting)

I 1 I 10 I 7 I 4 I 5

S a S b S c S d S e

I 1 10 7 4 5

S a b c d e

Aggregator Leaf Tailer Architecturefor cloud scaling ANDlow data latency

The economics of the cloud

Cost of 1 CPU for 100 minutes = Cost of 100 CPU for 1 minute

➔ Without the cloud: statically provision for peak demand

➔ Without the cloud: If query is slow then add more hardware

➔ With the cloud: dynamically provision for current demand

➔ With the cloud: If query is slow then write software

Aggregator Leaf Tailer (ALT) Architecture

2929https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/

How does ALT auto-scale?

Each tailer, leaf, or aggregator can be independently scaled up and down as needed. Tailers scale when more data to ingest, leaves scale when data size grows and aggregators scale when query volume increases.

● No provisioning● Pay for usage● No need to provision for peak capacity

30https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/

How leaf nodes scale

Scaling tailers and aggregators are easy because they are stateless. Scaling leaf nodes is tricky because they are stateful

Scale down leaves: Use durability of cloud storage

Rockset SQL API

Aggregator Aggregator

Object Storage (AWS S3, GCS, ...)

RocksDB-Cloud

RocksDB

RocksDB-Cloud

RocksDB

RocksDB-Cloud

RocksDB

Tailer

SST filesSST files SST files

Data files are uploaded to the cloud storage

Durability is achieved by keeping data in cloud storage

No data loss even if all leaf servers crash and burn!

RocksDB-Cloud

Object Storage (AWS S3, GCS, ...)

Rockset SQL API

Aggregator Aggregator

RocksDB-Cloud

Tailer

RocksDB RocksDB RocksDB

SST files SST files

Scale up new leaves without peer to peer replication

SST files

RocksDB

RocksDB-Cloud

SST files

Converged Indexingfor low query latency

Smart schemasto handle semi-structured data

SQLfor query complexity and widespread adoption

Aggregator Leaf Tailer Architecturefor efficiency

Thank youDhruba Borthakur / dhruba@rockset.comStart a free trial at rockset.com

Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for...

Documents

Relational Databases and SQL

HSQ - DATABASES & SQL

Databases & Microsoft SQL Server

Accessing SQL Server Databases with PHPdownload.microsoft.com/.../sql_server_access_via_php.docx · Web viewAccessing SQL Server Databases with PHP24 Accessing SQL Server Databases

Query Tuning Azure SQL Databases

Databases and Sql

Databases : SQL-Introduction

Spatial Databases With SQL Server

Android Chapter17 SQL Databases

Characteristics of no sql databases

NoSQL Databases - id2221kth.github.io · SQL vs. NoSQL Databases 11/88. Relational SQL Databases I Thedominanttechnology for storingstructureddata in web and business applications

Windows Azure SQL Databases

Building better SQL Server Databases

NoSQL and SQL Databases

Using Relational Databases and SQL

Databases and SQL · Databases and SQL Case Study: Server Database Example Adam Harewood January 2008

A Comparison of SQL and NoSQL Databases Databases

Databases and SQL - Duke University

Databases, SQL and MS SQL Server

SQL vs. NoSQL Databases