O'Reilly Media Webcast: Building Real-Time Data Pipelines

Building Real-Time Data Pipelines Through In-Memory Architectures

Ben Lorica, Chief Data Scientist, O'Reilly Media@bigdata

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Going Real-Time is the Next Phase for Big Data

More Sensors

More Interconnectivity

More User Demand

…and companies are at risk of being left behind

ExpensiveNot scalableBatch onlySAN-burdened

Success will be driven by real-time analytic applications

What’s In Store

About MemSQL

ServingBatch Fast Updates

Unified queries, full SQL

Fast Appends

A Fresh Look at Lambda Architectures

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

Rowstore

Analytics

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Rowstore

Fast Appends

Columnstore

Analytics

Execution engine that spans the data spectrum

Rowstore

Fast Appends

Columnstore

Analytics

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Lambda Applies to Real-Time Data Pipelines

Message Queue

Inputs DatabaseTransformation Application

Kafka, Spark, and MemSQL Make it Simple

Inputs Application

Put Apache Spark in the fast lanewith MemSQL Streamliner

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

Introducing the MemSQL Streamliner

Simple Deployment Process

Application

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

Cluster

2. Deploy Spark

Application

Cluster

Kafka Connects to Each Node

Application

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

Custom

Future Transformer

STREAMLINER

Extract Transform Load

Streamliner

Extract

Transform

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

Spark WorkerSpark Worker

Executor (P1 or P2)

What’s In Store

About MemSQL

One Architecturefor Many Applications

Monitoring real-time Xfinity programming and video health

Collect streaming data at scale (hundreds of MemSQL machines)

Proactively diagnose issues Query ad-hoc and in real-time

with full SQL

From 30 minutes to less than 1 second

Real-time Analytics

Real-Time Trend Analytics

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications

Real-time analytics

Real-Time Segmentation

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay)HDFS

Data Science

VerticaOperational Data Store (ODS)

Star Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

MemCityCapturing data from 1.4 million householdsTotal AWS hardware costs at $2.35 per hour

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

Building Real-Time Data Pipelines and Predictive Applications

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

What’s In Store

About MemSQL

MemSQL at a Glance

• Enable every company to be a real-time enterprise• Founded 2011, based in San Francisco• Founders are ex-Facebook, SQL Server engineers• Deliver a database technology for modern

architecture

Enterprise Focus

The Real-Time Database for Transactions and Analytics

In-Memory Distributed Relational

Data CenterSoftware Cloud

MemSQL for the Spectrum of Transactions

Each Transaction Paramount Transactional Aggregates Paramount

Guarantee that every individual transaction is persisted

No individual transaction can be lost• Financial credits and debits• Inventory movement• Employee status

Capture massive event streams for immediate analysis

Transaction repetition/redundancy at the device level

• Event data and clickstreams• Sensor data, Internet of Things• Mobile applications• Real-time streams

Gartner Magic Quadrant for ODBMS

Leading Relational Database in

Visionaries Quadrant

Forrester Wave: In-Memory Database Platforms

”“MemSQL Named Strong Performer

GET YOUR FREE COPY:memsql.com/oreilly

O'Reilly Media Webcast: Building Real-Time Data Pipelines

Software

O'Reilly Drupal Webcast

Finbarr O'Reilly

O'Reilly - Linux Pocket Guide

Treatment - O'Reilly - PanCAN

Preparing for a Cyber Attack By Jeffrey Carr CEO and Founder, GreyLogic.us Author, "Inside Cyber Warfare" (O'Reilly Media, 2009) O'Reilly Gov 2.0 Webcast

O'Reilly Webcast About Temporal Data Visualization

O'Reilly - Learning Perl

Communicating with Hardware - O'Reilly Media · Communicating with Hardware - O'Reilly Media ... buf.:

O'reilly learning uml.chm

Software Above the Level of a Single Device: The Implications-(Tim O'Reilly, O'Reilly Media)

Understanding Spark Tuning - O'Reilly

Sales - O'Reilly

O'Reilly - Initial Coverage

MassDOT Developers - O'Reilly Media Webcast Presentation

O'REILLY - hse.ru

O'Reilly Auto Parts

(O'Reilly) Linux Server Hacks

O'Reilly Automotive Stock Research

O'Reilly WebCast - Trends And Technologies In Where2.0

Tony O'Reilly Interview