View
1.337
Download
2
Category
Preview:
Citation preview
Building Real-Time Data Pipelines Through In-Memory Architectures
Ben Lorica, Chief Data Scientist, O'Reilly Media@bigdata
Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel
What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
Going Real-Time is the Next Phase for Big Data
More Sensors
More Interconnectivity
More User Demand
…and companies are at risk of being left behind
ExpensiveNot scalableBatch onlySAN-burdened
1%
Success will be driven by real-time analytic applications
What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
Speed
ServingBatch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
Comprehensive Architecture
Tran
sact
ions
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
RowstoreTran
sact
ions
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
Analytics
Tran
sact
ions
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
Execution engine that spans the data spectrum
Comprehensive Architecture
Real TimeSpeed/Streaming LayerFast Updates
Rowstore
HistoricalBatch Layer
Fast Appends
Columnstore
Analytics
Tran
sact
ions
Simplified Lambda Architectures with MemSQL
Layer Traditional Lambda MemSQL Lambda
Batch Hadoop MemSQL Column Store
Speed Storm, Spark Kafka > Spark > MemSQL
Serving Cassandra, HBase MemSQL
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
A high-throughput distributed messaging system
Publish and subscribe to Kafka “topics”
Centralized data transport for the organization
Kafka
In-memory execution engine
High level operators for procedural and programmatic analytics
Faster than MapReduce
Spark
In-memory, distributed database
Full transactions and complete durability
Enable real-time, performant applications
MemSQL
Lambda Applies to Real-Time Data Pipelines
Message Queue
Batch
Inputs DatabaseTransformation Application
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
Put Apache Spark in the fast lanewith MemSQL Streamliner
One click deployment of integrated Apache Spark
Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation
Eliminates batch ETL Open source on GitHub
Introducing the MemSQL Streamliner
Simple Deployment Process
Application
Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application
Cluster
2. Deploy Spark
Application
Cluster
Kafka Connects to Each Node
Application
Streamliner Architecture
First of many integrated Apache Spark solutions
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
Streamliner ETL Detail
Other Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine Learning Solution
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
STREAMLINER
Extract Transform Load
Streamliner
Extract
Transform
Load
Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor (P2 only)
Executor (P2 only)
Executor (P1 only)
Executor (P1 only)
Driver (P1 only)
Driver (P2 only)
All Pipelines
Streamliner Driver…
…
Spark WorkerSpark Worker
Executor (P1 or P2)
Executor (P1 or P2)
Executor (P1 or P2)
Executor (P1 or P2)
What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
One Architecturefor Many Applications
Monitoring real-time Xfinity programming and video health
Collect streaming data at scale (hundreds of MemSQL machines)
Proactively diagnose issues Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time Analytics
Real-Time Trend Analytics
Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications
Real-time analytics
Real-Time Segmentation
Using Real-Time for Personalization
Ad Servers EC2
Real-time analytics
PostgreSQLLegacy reports
Monitoring S3 (replay)HDFS
Data Science
VerticaOperational Data Store (ODS)
Star Schema MictoStrategy
Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times
MemCityCapturing data from 1.4 million householdsTotal AWS hardware costs at $2.35 per hour
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010111100001110101100000010010010111…
1110010101000101010001010100010111111010100011110101100011010101000…
0101111000011100101010111110001111011010111100000000101110101100000…
Event added to message queue
Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010111100001110101100000010010010111…
Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip device_id device_type watts
2015-07-
06T16:43:40.33
Z
329280 94110 23 ‘kitchen_appliance’ 60
… … … … … …
Go to Production
Compress development timelines
SELECT ... FROM memcity_table ...
Building Real-Time Data Pipelines and Predictive Applications
Adding Real-Time Scoring to Predictive Applications
StreamlinerInput
User JarSAS Generated PMML
Industrial Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data with Predictive Models
Sensor 1 Predictive Model 1
What’s In Store
Why In-Memory for Real Time
Using an In-Memory Database with Spark and Kafka
Real-Time Use Cases and Demonstrations
About MemSQL
MemSQL at a Glance
• Enable every company to be a real-time enterprise• Founded 2011, based in San Francisco• Founders are ex-Facebook, SQL Server engineers• Deliver a database technology for modern
architecture
Enterprise Focus
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
MemSQL for the Spectrum of Transactions
Each Transaction Paramount Transactional Aggregates Paramount
Guarantee that every individual transaction is persisted
No individual transaction can be lost• Financial credits and debits• Inventory movement• Employee status
Capture massive event streams for immediate analysis
Transaction repetition/redundancy at the device level
• Event data and clickstreams• Sensor data, Internet of Things• Mobile applications• Real-time streams
Gartner Magic Quadrant for ODBMS
Leading Relational Database in
Visionaries Quadrant
Forrester Wave: In-Memory Database Platforms
”“MemSQL Named Strong Performer
GET YOUR FREE COPY:memsql.com/oreilly
Recommended