SQL on Hadoop for Enterprise Analytics

A LITTLE BIT OF HISTORY

Everything old is new again. SQL Forever.

The story so far

Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!

Everything old is new again

Existing architecture keeps reappearing

It takes time to figure out what tools are right for what jobs

SQL is still the best tool for business analytics

A long long time ago…

Growing pains

Late 1990

Database problems

Database outage

Data integrity issues

Data latency

Late 1990

Master Slave

Late 1990

Transactions

Late 1990

Performance

Late 1990

By the time I graduated, SQL was on its last legs

2009

Cache all the things!

2009

Stop copying Twitter!

2009

SQL golden age ends, NoSQL takes off

2010

Column Graph

Key-Value Document

NoSQL

2010

Awesome things about NoSQL

No SQL, normal languages as APIs!

Non relational!

FAST!

2010

Remember ORMs?

~2000

Active Record

~2000

ORMs 👎

2011

Remember EAV(Entity Attribute Value)?

1968

Kind of looks like columns…

1968

Modern EAV

2010

Tedious to query

2010

Voila!

2010

No joins is a feature!

2010

NoSQL has some rough bumps

2010

NoSQL has A LOT of rough bumps…

2011

Throwback Thursday!

2011

Lock the doors

2011

MPP columnar DBs! Wait... SQL is back?!

2015

Hadoop on SQL

2016

A long long time ago…

What’s next?

~2020?

What’s next?

~2020?

“If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon

SQL is far past hype

Fin

“If it ain’t broke, don’t fix it”

CUSTOMER STORYBuilding a event analytics pipeline

using Hadoop and Spark

Why Consider a Big Data Pipeline?

37

You are rapidly exceeding the limits of your existing database

Everything on your website can be

analyzed.

Waiting until the next day isn’t for

you

Data comes and goes to many places, and you want

one process for it

Big DATA CULTURE

38

Summary data is not good enough

Company is mandating new technologies

You want to build a data driven culture

Big SQL is the heart of a data-driven culture

CASE STUDY

39

A major healthcare provider wants to create a web event pipeline that:

During periods of healthcare registration and new coverage

start and can dial back the rest of the year

Massive Scaling Large data volumes

10-15M customers worth of data. Provides data for

analysis in under 1 minute.

AND Utilizes existing in house technologies (such as Cloudera Impala)

Page loadsRegistrations

LoginsErrors

All events processed

Solution: Build an event processing framework

5

Events

Event Collector

Hadoop?

High Level Process

6

Events

Event Collector

Message ProcessingHDFS

Looker

To be designed

Why is Hadoop so hard?

7

Need to write in Java and Scala

We don’t have structure

Not easy to get data out into BI tools

Event Collectors don’t tend to feed to HDFS

out of the box

Typically follow a batch processing

framework

Ingestion mechanism

8

Low-Latency In flight transformation and

processing

Ability to populate multiple destinations

Our ideal ingestion would have three key aspects

Spark vs Storm

9

VS

• Own Master Server • Run on HDFS• Micro batching • Exact once delivery

(eliminates vulnerability)

• Not native to Hadoop• Less Developed• One at a time• ETL in flight• Sub second latency

Two of the major players in data streaming / processing

Flume

45

Source Interceptor Selector Channel Sinks

Managed by the Flume Agent

Web Server

Web Server

Web Server

Web Server Investor Channel

HDFSNo in flight transformation, so this just needs to meet workload

KAFKA

46

Broker

Broker

Broker

Producer Broker Consumer

Producer

Producer

Spark Streaming

Other

ZooKeeper

Broker

Flume vs. Kafka

12

Use Both: Out-of-the box with Flafka and native connectors

Flume

Kafka

Source

SparkCustom

connector

Customconnector

Flume KafkaSource Spark

Storing the output

48

Data can be queried via Hive, Impala, or Spark SQL

Cloudera is our Enterprise choice

We can process a subset in-stream with Mlib or other machine learning

algorithms

Output summaries to other RDBMS

systems

Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because

Final Result

14

Events

Event Collector Kafka

Flume Spark SQL Cloudera

Other storage (RDBMS)

Other storage (logs)

Pipeline Summary

15

Add data to any point of the pipeline

Kafka, Flume, Impala, Looker without many

custom connectors

Pipeline includes additional sources like teradata, oracle

Add in-flight predictive model training and execution without significant additional processing time

Our pipeline provides several points for flexibility as well as meets our key priorities.

Priority # 1: Scale

Kafka is easy to scale, As more volume comes in, adding new brokers can be automated using the Partition Reassignment Tool

By monitoring batch times in Looker on Spark SQL, we can alert when we need to scale up the cluster using Scheduled Looks

16

Priority #2: Flexibility

17

Different events can be parsed out to different Spark streaming applications with Kafka topics (Or another type of consumer)

Add more data at any point (flume, kafka producer, or directly to spark)

Looker connects to wherever the data lands, as long as we can query it. Perform analysis IN CLUSTER

Priority #3 Speed Analyzing the stream

53

Events per hour

Identify missing batches

Volume and Timing

Right sizing hardware

Duplicate events

And missing information

Priority #4: In house Technologies

19

Provide access to Hadoop/Impala via a centralized data hub:A single place to access web based reports, explores, BI tools and code libraries

Enable users to ask questions and query web data without writing SQL or knowing about the pipeline

Analyzing the stream

55

Looking for Lost data

=/=

Analyzing the stream

21

By connecting Looker to various points in the stream we can verify complete loads:

We also mask the location of information, one dashboard may show a variety of reliable sources.

• Impala SQL• Source Logs• Summary Reports

Other uses and benefits

57

Match data in flight to find bad user

accounts

In flight alerts for missing

data

Analysis without needing to know

the location in the stream

SQL on Hadoop BI solution doesn’t

require new skillset

THANK YOU!

Sources

http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolorhttp://seldo.com/weblog/2011/08/11/orm_is_an_antipatternhttp://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-Security.pdfhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodbhttps://www.percona.comhttp://techcrunch.com/http://mashable.com/

http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolor

http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolor

http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern

http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern

http://mashable.com/2010/10/04/foursquare-downtime/%23aPh4mhYxLSq6

http://mashable.com/2010/10/04/foursquare-downtime/%23aPh4mhYxLSq6

http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-Security.pdf



http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb

http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb

http://techcrunch.com/

http://techcrunch.com/

Technology

SQL on Hadoop for Enterprise Analytics