Upload
dataversity
View
738
Download
1
Embed Size (px)
Citation preview
A LITTLE BIT OF HISTORY
Everything old is new again. SQL Forever.
The story so far
Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!
Everything old is new again
Existing architecture keeps reappearing
It takes time to figure out what tools are right for what jobs
SQL is still the best tool for business analytics
A long long time ago…
Growing pains
Late 1990
Database problems
Database outage
Data integrity issues
Data latency
Late 1990
Master Slave
Late 1990
Transactions
Late 1990
Performance
Late 1990
By the time I graduated, SQL was on its last legs
2009
Cache all the things!
2009
Stop copying Twitter!
2009
SQL golden age ends, NoSQL takes off
2010
Column Graph
Key-Value Document
NoSQL
2010
Awesome things about NoSQL
No SQL, normal languages as APIs!
Non relational!
FAST!
2010
Remember ORMs?
~2000
Active Record
~2000
ORMs 👎
2011
Remember EAV(Entity Attribute Value)?
1968
Kind of looks like columns…
1968
Modern EAV
2010
Tedious to query
2010
Voila!
2010
No joins is a feature!
2010
NoSQL has some rough bumps
2010
NoSQL has A LOT of rough bumps…
2011
Throwback Thursday!
2011
Lock the doors
2011
MPP columnar DBs! Wait... SQL is back?!
2015
Hadoop on SQL
2016
A long long time ago…
What’s next?
~2020?
What’s next?
~2020?
“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon
SQL is far past hype
Fin
“If it ain’t broke, don’t fix it”
CUSTOMER STORYBuilding a event analytics pipeline
using Hadoop and Spark
Why Consider a Big Data Pipeline?
37
You are rapidly exceeding the limits of your existing database
Everything on your website can be
analyzed.
Waiting until the next day isn’t for
you
Data comes and goes to many places, and you want
one process for it
Big DATA CULTURE
38
Summary data is not good enough
Company is mandating new technologies
You want to build a data driven culture
Big SQL is the heart of a data-driven culture
CASE STUDY
39
A major healthcare provider wants to create a web event pipeline that:
During periods of healthcare registration and new coverage
start and can dial back the rest of the year
Massive Scaling Large data volumes
10-15M customers worth of data. Provides data for
analysis in under 1 minute.
AND Utilizes existing in house technologies (such as Cloudera Impala)
Page loadsRegistrations
LoginsErrors
All events processed
Solution: Build an event processing framework
5
Events
Event Collector
Hadoop?
High Level Process
6
Events
Event Collector
Message ProcessingHDFS
Looker
To be designed
Why is Hadoop so hard?
7
Need to write in Java and Scala
We don’t have structure
Not easy to get data out into BI tools
Event Collectors don’t tend to feed to HDFS
out of the box
Typically follow a batch processing
framework
Ingestion mechanism
8
Low-Latency In flight transformation and
processing
Ability to populate multiple destinations
Our ideal ingestion would have three key aspects
Spark vs Storm
9
VS
• Own Master Server • Run on HDFS• Micro batching • Exact once delivery
(eliminates vulnerability)
• Not native to Hadoop• Less Developed• One at a time• ETL in flight• Sub second latency
Two of the major players in data streaming / processing
Flume
45
Source Interceptor Selector Channel Sinks
Managed by the Flume Agent
Web Server
Web Server
Web Server
Web Server Investor Channel
HDFSNo in flight transformation, so this just needs to meet workload
KAFKA
46
Broker
Broker
Broker
Producer Broker Consumer
Producer
Producer
Spark Streaming
Other
ZooKeeper
Broker
Flume vs. Kafka
12
Use Both: Out-of-the box with Flafka and native connectors
Flume
Kafka
Source
SparkCustom
connector
Customconnector
Flume KafkaSource Spark
Storing the output
48
Data can be queried via Hive, Impala, or Spark SQL
Cloudera is our Enterprise choice
We can process a subset in-stream with Mlib or other machine learning
algorithms
Output summaries to other RDBMS
systems
Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because
Final Result
14
Events
Event Collector Kafka
Flume Spark SQL Cloudera
Other storage (RDBMS)
Other storage (logs)
Pipeline Summary
15
Add data to any point of the pipeline
Kafka, Flume, Impala, Looker without many
custom connectors
Pipeline includes additional sources like teradata, oracle
Add in-flight predictive model training and execution without significant additional processing time
Our pipeline provides several points for flexibility as well as meets our key priorities.
Priority # 1: Scale
Kafka is easy to scale, As more volume comes in, adding new brokers can be automated using the Partition Reassignment Tool
By monitoring batch times in Looker on Spark SQL, we can alert when we need to scale up the cluster using Scheduled Looks
16
Priority #2: Flexibility
17
Different events can be parsed out to different Spark streaming applications with Kafka topics (Or another type of consumer)
Add more data at any point (flume, kafka producer, or directly to spark)
Looker connects to wherever the data lands, as long as we can query it. Perform analysis IN CLUSTER
Priority #3 Speed Analyzing the stream
53
Events per hour
Identify missing batches
Volume and Timing
Right sizing hardware
Duplicate events
And missing information
Priority #4: In house Technologies
19
Provide access to Hadoop/Impala via a centralized data hub:A single place to access web based reports, explores, BI tools and code libraries
Enable users to ask questions and query web data without writing SQL or knowing about the pipeline
Analyzing the stream
55
Looking for Lost data
=/=
Analyzing the stream
21
By connecting Looker to various points in the stream we can verify complete loads:
We also mask the location of information, one dashboard may show a variety of reliable sources.
• Impala SQL• Source Logs• Summary Reports
Other uses and benefits
57
Match data in flight to find bad user
accounts
In flight alerts for missing
data
Analysis without needing to know
the location in the stream
SQL on Hadoop BI solution doesn’t
require new skillset
THANK YOU!
Sources
http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolorhttp://seldo.com/weblog/2011/08/11/orm_is_an_antipatternhttp://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-Security.pdfhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodbhttps://www.percona.comhttp://techcrunch.com/http://mashable.com/