Splice machine-bloor-webinar-data-lakes

Ge#ng Started with Hadoop: Opera4onal Data Lake

Rich Reimer VP, Product Management

[email protected]

2

The Big Squeeze Data growing much faster than IT budgets

Source: 2013 IBM Briefing Book

Source: Gartner, Worldwide IT, Spending forecast, 3Q13 Update

Tradi4onal RDBMSs Giants Overwhelmed… Scale-‐up becoming cost-‐prohibi:ve

Splice Machine | Proprietary & Confiden4al

4

Scale-‐Out: The Future of Databases Drama:c improvement in price/performance

Scale Up (Increase server size)

Scale Out (More small servers)

vs. $ $ $ $ $ $

5

What is a Data Lake?

•  Scale-‐out technology based on Hadoop

•  Data stored in na4ve formats

6

Schema on Ingest vs. Schema on Read

§  Even “schemaless” MongoDB requires “schema” -  10 Things You Should Know About Running MongoDB At Scale

•  By Asya Kamsky, Principal Solu4ons Architect at MongoDB •  Item #1 – “have a good schema and indexing strategy”

Schema on Ingest

Schema on Read

•  Schema on Read if you only use data a few times a year

•  Structured data should always remain structured

•  Add schema if data used regularly

Data Stream Application

7

Who Are We?

THE ONLY HADOOP RDBMS Replace your old RDBMS

with a scale-‐out SQL database Affordable, Scale-‐Out ACID Transac4ons No Applica4on Rewrites

10x Bemer

Price/Perf

8

Reference Architecture: Opera4onal Data Lake Offload real-‐:me repor:ng and analy:cs from expensive OLTP and DW systems

OLTP Systems

Ad Hoc Analytics

Operational Data Lake

Executive Business Reports

Operational Reports & Analytics

ERP

CRM

Supply Chain

HR

…

Data Warehouse

Datamart

Stream or Batch

Updates

ETL

Real-Time, Event-Driven

Apps

Streamlining the Structured Data Pipeline in Hadoop

9

Source Systems

ERP

…

CRM

Sqoop

Apply Inferred Schema

Stored as flat files

SQL Query Engines BI Tools

Tradi=onal Hadoop Pipeline

vs.

Source Systems

ERP

…

CRM

Existing

ETL Tool

Stored in same

schema

BI Tools

Streamlined Hadoop Pipeline Advantages •  Reduced opera4onal costs with less complexity

•  Reduced processing 4me and errors with fewer transla4ons

•  Real-‐4me updates for data cleansing

•  Bemer SQL support

10

Streamlining and Hardening the ETL Processing Pipeline Gracefully handle data quality issues and failed queries without full data reloads

Issue Hadoop Issues Splice Machine Solu=on

Handle Data Quality Issues

(e.g., duplicates)

Hours to correct ✗  Run slow MapReduce job to de-‐dupe ✗  Reload en4re data set (hours)

Seconds to correct ✓ Insert fails due to constraint viola4on ✓ Rollback flawed updates if necessary ✓ Reject, replace, or merge duplicates with incremental

update (ms to sec)

Update/Delete Data

Hours to correct ✗  Reload en4re data set (hours) ✗  Writers block readers

Seconds to correct ✓ Correct data and do incremental update (ms to sec) ✓ Consistent view of data even with many concurrent updates ✓ Writers don’t block readers

ETL Failure Hours to correct ✗  Reload en4re data set (hours) ✗  Miss ETL window, leading to either delayed

reports or stale data

Seconds to correct ✓ Rollback failed step ✓ Retry failed step and con4nue

Fast Query Speeds ✗  Results typically no faster than seconds because data stored in random formats ✗  MapReduce

✓ Results possible in milliseconds because data stored in highly op4mized format

✓ No MapReduce

11

Complemen4ng Exis4ng Hadoop-‐Based Data Lakes Op:mizing storage and querying of structured data as part of ELT or Hadoop query engines

OLTP Systems

ERP

CRM

Supply Chain

HR

…

SCHEMA ON INGEST:

Streamlined, structured-to-

structured integration

Structured Data

Unstructured Data

1

2

3

SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data

HCATALOG

Pig

SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data

Case Study: Opera4onal Data Lake

12 12

Overview   Computer technology corpora4on   Update database technology for:   ODS layer replacement   ETL processing and analysis of Omniture data   Real-‐4me OLTP for Global Tech Support app

Challenges   Oracle and Teradata too expensive to scale

  Many Oracle queries couldn’t complete

  Can only hold 7 days worth of data in Oracle

  Missing ETL window with current Hadoop data lake

Solu5on Diagram

(400TB)

OLTP Systems

ERP

CRM

Supply Chain

Benefits

75% less cost with commodity scale out

Incremental ETL processing gracefully handle data quality issues

5x-‐10x faster comple4ng queries on which Oracle failed

✔

13

Reference Architecture: Unified Customer Profile Improve marke:ng ROI with deeper customer intelligence and beKer cross-‐channel coordina:on

Unified Customer Profile

(aka DMP)

Operational Reports for Campaign Performance

Social Feeds

Web/eCommerce Clickstreams

Website Datamart

Stream or Batch Updates

BI Tools

Demand Side Platform (DSP)

Ad Exchange

1st Party/ CRM Data

3rd Party Data (e.g., Axciom)

Ad Perf. Data (e.g., Doubleclick)

Email Mktg Data

Call Center Data

POS Data

Email Marketing App

Ad Hoc Audience Segmentation

BI Tools

14

Campaign Management: Harte-‐Hanks Overview   Digital marke4ng services provider   Unified Customer Profile   Real-‐4me campaign management   Complex OLTP and OLAP environment

Challenges   Oracle RAC too expensive to scale

  Queries too slow – even up to ½ hour

  Ge#ng worse – expect 30-‐50% data growth

  Looked for 9 months for a cost-‐effec4ve solu4on

Solu5on Diagram

Ini5al Results

¼ cost with commodity scale out

3-‐7x faster through parallelized queries

10-‐20x price/perf with no applica4on, BI or ETL rewrites

Cross-Channel Campaigns

Real-Time Personalization

Real-Time Actions

15

Proven Building Blocks: Hadoop and Derby

APACHE DERBY §  ANSI SQL-‐99 RDBMS §  Java-‐based §  ODBC/JDBC Compliant

APACHE HBASE/HDFS §  Auto-‐sharding §  Real-‐4me updates §  Fault-‐tolerance §  Scalability to 100s of PBs §  Data replica4on

Typical Database Workloads

16

Opera=onal Applica=ons

Opera=onal Repor=ng & Analy=cs

Ad-‐Hoc Analy=cs Enterprise Data Warehouses

Typical Databases

•  MySQL •  Oracle •  MongoDB

•  MySQL •  Oracle

•  Greenplum •  Paraccel •  Netezza

•  Teradata •  Oracle •  Sybase IQ

Use Cases •  OLTP -‐ ERP, CRM •  Websites

•  Opera4onal Datastores

•  Exploratory Analy4cs •  Data Mining

•  Enterprise Repor4ng

Typical Users •  Customers •  Opera4onal

Employees

•  Opera4onal Employees

•  Analysts •  Data Scien4sts

•  Managers •  Execu4ves

Workload Strengths

•  High concurrency of small reads/ writes

•  Range queries

•  Parameterized reports against real-‐4me data

•  Range queries

•  Complex queries requiring full table scans

•  Parameterized reports against historical data

17

Internet of Things

Opera4onal Data Lake Digital Marke4ng

Personalized Medicine

Use Cases

Splice Machine | Proprietary & Confiden4al

Fraud Detec4on

18

Opera4onal Data Lake: Great On-‐Ramp to Big Data

§  Clear Business Value Now

§  Replace obsolete Opera4onal Data Stores (ODSs) §  Exis4ng use cases – not just a science project §  Hadoop RDBMS – inexpensive to store data

§  Incremental On-‐Ramp to Big Data §  Start with structured data and then expand to unstructured

§  Add schema when needed

Ge#ng Started with Hadoop: Opera4onal Data Lake

Rich Reimer VP, Product Management

[email protected]

Data & Analytics

Splice machine-bloor-webinar-data-lakes