35
Modern Databases: The Return of SQL with the Scaling Power of NoSQL Dhruba Borthakur/ CTO and Co-founder of Rockset DeveloperWeek 2020

Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Modern Databases: The Return of SQL with the Scaling Power of NoSQL

Dhruba Borthakur/ CTO and Co-founder of RocksetDeveloperWeek 2020

Page 2: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

About Me

2

● CTO and Co-Founder of Rockset● Founding eng of RocksDB @ Facebook● Founding eng of Hadoop FS @ Yahoo● HBase & Hive developer● Andrew File System (AFS) developer● Univ of Wisconsin, Madison student

Page 3: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Two flavors of databases for analytics

33

SQL databases NoSQL databases

● well-defined standard query language

● structured data● complex analytics● requires substantial data cleaning

and preparation

● scalable● fast● no upfront schemas● not tuned for complex analytics● may need to learn a new query

language

Page 4: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

The modern database- takes the best aspects of SQL and NoSQL

44

SQL on NoSQL

Page 5: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Modern databases have these characteristics

5

Schema with every record rather than every tableto handle semi-structured data

Declarative way to express complex logicfor query complexity and widespread adoption

Distributed systemsfor scale and efficiency

Low latencyfor fast insights on fresh data

SQL databases NoSQL databases

Page 6: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

There are 2 additional considerations for modern databases

1. Data complexityIndexing for low query latency

66

2. Data volume Decoupling ingest and query compute for low data latency

Page 7: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Enter RocksetThe Real-time Database in the

Cloud

Page 8: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

This talk will cover the architectural decisions behind Rockset including:

1. Relational document model

2. Smart schemas: A schema with every record

3. Converged indexing: 3-way indexes on your data

4. Aggregator Leaf Tailer Architecture (ALT): Cloud-native scaling for low data latency

88

Page 9: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Relational document model

Page 10: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Document vs. relational models

Document Model

1010

Relational Model

Relational diagram taken from Adam Adamowicz “How to create ER diagram for existing SQL server database with SSMS”

Page 11: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Rockset’s relational document data model

1111

Page 12: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Design characteristics of the relational document data model

● No upfront schema required before adding records to the database● No data denormalization (natively supports semi-structured formats

JSON, Parquet, XML, CVS, and TSV)● All data is indexed● Custom SQL extensions for dealing with nested documents and

arrays

1212

Page 13: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Converged Indexingfor low query latency

Page 14: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Traditional Databases: 1 Database = 1 Index

1414

DruidColumnar Index

ElasticInverted Index

RedisRow Index

Data QueryETL

ISSUES 1. ETL Pain 2. Consistency Battles 3. Replication across 3 systems 4. Query across 3 systems 5. Manage indexes

Page 15: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

What is converged indexing?

Rockset stores every column of every document in a row-based index, column-based index, AND an inverted index.

● No need to configure and maintain indexes● No slow queries because of missing indexes

1515

INDEX ALL FIELDS

Page 16: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Converged indexing under the hood

● Columnar and inverted indexes in the same system● Built on top of key-value store abstraction● Each document maps to many key-value pairs

1616

<doc 0> { “name”: “Igor” }

<doc 1> { “name”: “Dhruba” }

Key Value

R.0.name Igor Row Store

R.1.name Dhruba

C.name.0 Igor Column Store

C.name.1 Dhruba

S.name.Dhruba.1 Inverted index

S.name.Igor.0

Page 17: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Inverted index for point lookups

● For each value, store documents containing that value (posting list)● Quickly retrieve a list of document IDs that match a predicate

1717

“name”

“interests”

Dhruba 1

Igor 0

databases 0.0; 1.1

cars 1.0

snowboarding 0.1

“last_active”2019/3/15 0

2019/3/22 1

<doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 }

<doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

Page 18: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Columnar index for aggregations

● Store each column separately● Great compression● Only fetch columns the query needs

1818

<doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 }

<doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

“name”

“interests”

0 Igor

1 Dhruba

0.0 databases

0.1 snowboarding

1.0 cars

1.1 databases

“last_active”0 2019/3/15

1 2019/3/22

Page 19: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Query optimizer selects the index

1919

● Low latency for both highly selective queries and large scans● Optimizer picks between columnar store or inverted index

SELECT keyword, count(*) FROM search_logs GROUP BY keyword ORDER BY count(*) DESC

Inverted index(for highly selective queries)

SELECT * FROM search_logs WHERE keyword = ‘hpts’ AND locale = ‘en’

Columnar store(for large scans)

Page 20: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Smart schemasfor schematizing every record rather than every table

Page 21: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Schemas change with time in the modern world

2121

The modern world needs to 1)handle schema changes automatically

2) avoid data cleaning with schemas

Page 22: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

What are smart schemas?

Automatic generation of a schema based on the exact fields and types present at the time of ingest. There is no need to ingest the data with a predefined schema (ie: schemaless ingest).

● No need to manage pipelines that cause data latency● Handles semi-structured data (adopting best practices from NoSQL)● Makes it easy to read the data (using standard SQL)● Filters the data at query time (no need for data cleaning)

2222

Page 23: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Smart schemas under the hood

● Type information stored with values not columns

● Strongly typed queries on dynamically typed fields

● Designed for nested semi-structured data

2323

Page 24: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Challenges with Smart Schemas

1. Requires increased disk space to store types2. Requires additional CPU usage for processing queries

2424

Page 25: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Field interning reduces the space required to store schemas

2525

Strict Schema(Relational DB)

Schema Data Storage

City: String

name: String

age: Int

Rancho Santa Margarita Jonathan 35

Rancho Santa Margarita Katherine 25

Schemaless(JSON DB)

“City”: “Rancho Santa Margarita” “name”: “Jonathan” “age”: 35

“City”: “Rancho Santa Margarita” “Name”: “Katherine” “age”: 25

Rockset(with field interning)~30% overhead

0: S “City” 1: S “Jonathan”

2: S “Katherine” 3: I 35 4: I 25 City: 0 name: 1 age: 3

City: 0 name: 2 age: 4

= Amount of storage

Page 26: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Type hoisting reduces CPU required to run queries

Rockset query performance is on par with strict schema systems

2626

Strict schema 1 10 7 4 5

a b c d e

Schemaless

Rockset(with type hoisting)

Rows

I 1 I 10 I 7 I 4 I 5

S a S b S c S d S e

I 1 10 7 4 5

S a b c d e

Page 27: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Aggregator Leaf Tailer Architecturefor cloud scaling ANDlow data latency

Page 28: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

The economics of the cloud

Cost of 1 CPU for 100 minutes = Cost of 100 CPU for 1 minute

2828

➔ Without the cloud: statically provision for peak demand

➔ Without the cloud: If query is slow then add more hardware

➔ With the cloud: dynamically provision for current demand

➔ With the cloud: If query is slow then write software

Page 29: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Aggregator Leaf Tailer (ALT) Architecture

2929https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/

Page 30: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

How does ALT auto-scale?

Each tailer, leaf, or aggregator can be independently scaled up and down as needed. Tailers scale when more data to ingest, leaves scale when data size grows and aggregators scale when query volume increases.

● No provisioning● Pay for usage● No need to provision for peak capacity

30https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/

Page 31: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

How leaf nodes scale

31

Scaling tailers and aggregators are easy because they are stateless. Scaling leaf nodes is tricky because they are stateful

Page 32: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Scale down leaves: Use durability of cloud storage

32

Rockset SQL API

Aggregator Aggregator

Object Storage (AWS S3, GCS, ...)

Leaf

RocksDB-Cloud

RocksDB

Leaf

RocksDB-Cloud

RocksDB

Leaf

RocksDB-Cloud

RocksDB

Tailer

SST filesSST files SST files

Data files are uploaded to the cloud storage

Durability is achieved by keeping data in cloud storage

No data loss even if all leaf servers crash and burn!

Page 33: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Leaf

33

Leaf

RocksDB-Cloud

Object Storage (AWS S3, GCS, ...)

Rockset SQL API

Aggregator Aggregator

Leaf

RocksDB-Cloud

Leaf

RocksDB-Cloud

Tailer

RocksDB RocksDB RocksDB

SST files SST files

Scale up new leaves without peer to peer replication

SST files

RocksDB

RocksDB-Cloud

SST files

Page 34: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Recap

34

Converged Indexingfor low query latency

Smart schemasto handle semi-structured data

SQLfor query complexity and widespread adoption

Aggregator Leaf Tailer Architecturefor efficiency

Page 35: Modern Databases: The Return of SQL with the Scaling Power ... · Two flavors of databases for analytics 33 SQL databases NoSQL databases well-defined standard query language structured

Thank youDhruba Borthakur / [email protected] a free trial at rockset.com