Taming the Big Data Fire Hose

the NewSQL database you’ll never outgrow

Taming the Big DataFire Hose

John HuggSr. Software Engineer, VoltDB

VoltDB 2

Big Data Defined

Velocity+ Moves at very high rates (think sensor-driven systems)+ Valuable in its temporal, high velocity state

Volume+ Fast-moving data creates massive historical archives+ Valuable for mining patterns, trends and relationships

Variety+ Structured (logs, business transactions)+ Semi-structured and unstructured

VoltDB 3

Lower-frequency operations

High-frequency operations

DataSource

Example Big Data Use Cases

Capital markets Write/index all trades, store tick data

Show consolidated risk across traders

Call initiation request Real-time authorization Fraud detection/analysis

Inbound HTTP requests

Visitor logging, analysis, alerting Traffic pattern analytics

Online gameRank scores:•Defined intervals•Player “bests”

Leaderboard lookups

Real-time ad trading systems

Match form factor, placement criteria, bid/ask

Report ad performance from exhaust stream

Mobile device location sensor

Location updates, QoS, transactions Analytics on transactions

VoltDB 4

Big Data and You

Incoming data streams are different than traditional business apps

+ You need to write data quickly and reliably, but …

It’s not just about high speed writes+ You need to validate in real-time+ You need to count and aggregate+ You need to analyze in real-time+ You need to scale on demand+ You may need to transact

Big Data and You

VoltDB 5

Big Data Management Infrastructure

Online gaming

Adserving

Sensordata

Internetcommerc

e

SaaS,Web 2.0

Mobileplatforms

Financialtrade

Structured data ACID guarantees Relational/SQL Real-time analytics

NewSQL

Unstructured data Eventual consistency Schemaless KV, document

NoSQL

Other OLAPdata stores

AnalyticDatastore

High Velocity High Volume

VoltDB 6

Big Data Management Infrastructure

Online gaming

Adserving

Sensordata

Internetcommerc

e

SaaS,Web 2.0

Mobileplatforms

Financialtrade

NewSQL

NoSQL

Other OLAPdata stores

AnalyticDatastore

High Velocity High Volume

High VelocityData Management

VoltDB 8

High Velocity DBMS Requirements

Ingest at very high speeds and rates Scale easily to meet growth and demand peaks Support integrated fault tolerance Support a wide range of real-time (or “near-time”)

analytics Integrate easily with high volume analytic datastores

VoltDB 9

High Speed Data Ingestion

Support millions of write operations per second at scale

Read and write latencies below 50 milliseconds Provide ACID-level consistency guarantees (maybe) Support one or more well-known application

interfaces+ SQL+ Key/Value+ Document

VoltDB 10

Scale to Meet Growth and Demand

Scale-out on commodity hardware Built-in database partitioning

+ Manual sharding and/or add-on solutions are brittle, require apps to do “heavy lifting”, and can be an operational nightmare

Database must automatically implement defined partitioning strategy

+ Application should “see” a single database instance

Database should encourage scalability best practices+ For example, replication of reference data minimizes need for

multi-partition operations

VoltDB 11

A Look Inside Partitioning

1 101 21 101 34 401 2

1 knife2 spoon3 fork

Partition 1

2 201 15 501 35 502 2


Partition 2

3 201 16 601 16 601 2


Partition 3

table orders : customer_id (partition key)(partitioned) order_id

product_id

table products : product_id (replicated) product_name

select count(*) from orders where customer_id = 5single-partition

select count(*) from orders where product_id = 3multi-partition

insert into orders (customer_id, order_id, product_id) values (3,303,2)single-partition

update products set product_name = ‘spork’ where product_id = 3multi-partition

VoltDB 12

Integrated Fault Tolerance

Database should transparently support built-in “Tandem-style” HA

+ Users should be able to easily increase/decrease fault tolerance levels

Database should be easily and quickly recoverable in the event of severe hardware failures

Database should be able to automatically detect and manage a variety of partition fault conditions

Downed nodes should be “rejoinable” without the need for service windows

VoltDB 13

Partition Detection & Recovery

Server A

Server B

Server C

Network fault protectionDetects partition event

Determines which side of fault to disable

Snapshots and disables orphaned node(s)

Server A

Server B

Server C

Live node rejoinAllows “downed” nodes to rejoin live cluster

Automatically re-synchs all node data

Coordinates transactions during re-synch

VoltDB 14

Real-time Analytics

Database should support a wide variety of high performance reads

+ High-frequency single-partition+ Lower-frequency multi-partition

Common analytic queries should be optimized in the database

+ Multi-partition aggregations, limits, etc.

Database should accommodate a flexible range of relational data operations

+ Particularly relevant to structured data

VoltDB 15

Integration with Analytic Datastores

Database should offer high performance, transactional export

Export should allow a wide variety of common data enrichment operations

+ Normalize and de-normalize+ De-duplicate+ Aggregate

Architecture should support loosely-coupled integrations

+ Impedance mismatches+ Durability

VoltDB 16

VoltDB Export Data Flow

Loosely-coupled, asynchronous Queue must be durable Bi-directional durability

High VelocityDatabase Cluster

VoltDB 17

Summary

Big Data infrastructures will usually require more than one engine

+ High velocity engine for “fast” data+ Analytic engine for “deep” data

Data characteristics will often determine which high velocity engine to use

+ NewSQL is often well-suited to structured data+ NoSQL is often a good fit for unstructured data

Choose solutions that suit your needs and are designed for interoperability

Documents

Taming the Big Data Fire Hose