Upload
edgar-alejandro-villegas
View
146
Download
0
Tags:
Embed Size (px)
Citation preview
Ge#ng Started with Hadoop: Opera4onal Data Lake
Rich Reimer VP, Product Management
2
The Big Squeeze Data growing much faster than IT budgets
Source: 2013 IBM Briefing Book
Source: Gartner, Worldwide IT, Spending forecast, 3Q13 Update
Tradi4onal RDBMSs Giants Overwhelmed… Scale-‐up becoming cost-‐prohibi:ve
Splice Machine | Proprietary & Confiden4al
4
Scale-‐Out: The Future of Databases Drama:c improvement in price/performance
Scale Up (Increase server size)
Scale Out (More small servers)
vs. $ $ $ $ $ $
5
What is a Data Lake?
• Scale-‐out technology based on Hadoop
• Data stored in na4ve formats
6
Schema on Ingest vs. Schema on Read
§ Even “schemaless” MongoDB requires “schema” - 10 Things You Should Know About Running MongoDB At Scale
• By Asya Kamsky, Principal Solu4ons Architect at MongoDB • Item #1 – “have a good schema and indexing strategy”
Schema on Ingest
Schema on Read
• Schema on Read if you only use data a few times a year
• Structured data should always remain structured
• Add schema if data used regularly
Data Stream Application
7
Who Are We?
THE ONLY HADOOP RDBMS Replace your old RDBMS
with a scale-‐out SQL database Affordable, Scale-‐Out ACID Transac4ons No Applica4on Rewrites
10x Bemer
Price/Perf
8
Reference Architecture: Opera4onal Data Lake Offload real-‐:me repor:ng and analy:cs from expensive OLTP and DW systems
OLTP Systems
Ad Hoc Analytics
Operational Data Lake
Executive Business Reports
Operational Reports & Analytics
ERP
CRM
Supply Chain
HR
…
Data Warehouse
Datamart
Stream or Batch
Updates
ETL
Real-Time, Event-Driven
Apps
Streamlining the Structured Data Pipeline in Hadoop
9
Source Systems
ERP
…
CRM
Sqoop
Apply Inferred Schema
Stored as flat files
SQL Query Engines BI Tools
Tradi=onal Hadoop Pipeline
vs.
Source Systems
ERP
…
CRM
Existing
ETL Tool
Stored in same
schema
BI Tools
Streamlined Hadoop Pipeline Advantages • Reduced opera4onal costs with less complexity
• Reduced processing 4me and errors with fewer transla4ons
• Real-‐4me updates for data cleansing
• Bemer SQL support
10
Streamlining and Hardening the ETL Processing Pipeline Gracefully handle data quality issues and failed queries without full data reloads
Issue Hadoop Issues Splice Machine Solu=on
Handle Data Quality Issues
(e.g., duplicates)
Hours to correct ✗ Run slow MapReduce job to de-‐dupe ✗ Reload en4re data set (hours)
Seconds to correct ✓ Insert fails due to constraint viola4on ✓ Rollback flawed updates if necessary ✓ Reject, replace, or merge duplicates with incremental
update (ms to sec)
Update/Delete Data
Hours to correct ✗ Reload en4re data set (hours) ✗ Writers block readers
Seconds to correct ✓ Correct data and do incremental update (ms to sec) ✓ Consistent view of data even with many concurrent updates ✓ Writers don’t block readers
ETL Failure Hours to correct ✗ Reload en4re data set (hours) ✗ Miss ETL window, leading to either delayed
reports or stale data
Seconds to correct ✓ Rollback failed step ✓ Retry failed step and con4nue
Fast Query Speeds ✗ Results typically no faster than seconds because data stored in random formats ✗ MapReduce
✓ Results possible in milliseconds because data stored in highly op4mized format
✓ No MapReduce
11
Complemen4ng Exis4ng Hadoop-‐Based Data Lakes Op:mizing storage and querying of structured data as part of ELT or Hadoop query engines
OLTP Systems
ERP
CRM
Supply Chain
HR
…
SCHEMA ON INGEST:
Streamlined, structured-to-
structured integration
Structured Data
Unstructured Data
1
2
3
SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data
HCATALOG
Pig
SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data
Case Study: Opera4onal Data Lake
12 12
Overview Computer technology corpora4on Update database technology for: ODS layer replacement ETL processing and analysis of Omniture data Real-‐4me OLTP for Global Tech Support app
Challenges Oracle and Teradata too expensive to scale
Many Oracle queries couldn’t complete
Can only hold 7 days worth of data in Oracle
Missing ETL window with current Hadoop data lake
Solu5on Diagram
(400TB)
OLTP Systems
ERP
CRM
Supply Chain
Benefits
75% less cost with commodity scale out
Incremental ETL processing gracefully handle data quality issues
5x-‐10x faster comple4ng queries on which Oracle failed
✔
13
Reference Architecture: Unified Customer Profile Improve marke:ng ROI with deeper customer intelligence and beKer cross-‐channel coordina:on
Unified Customer Profile
(aka DMP)
Operational Reports for Campaign Performance
Social Feeds
Web/eCommerce Clickstreams
Website Datamart
Stream or Batch Updates
BI Tools
Demand Side Platform (DSP)
Ad Exchange
1st Party/ CRM Data
3rd Party Data (e.g., Axciom)
Ad Perf. Data (e.g., Doubleclick)
Email Mktg Data
Call Center Data
POS Data
Email Marketing App
Ad Hoc Audience Segmentation
BI Tools
14
Campaign Management: Harte-‐Hanks Overview Digital marke4ng services provider Unified Customer Profile Real-‐4me campaign management Complex OLTP and OLAP environment
Challenges Oracle RAC too expensive to scale
Queries too slow – even up to ½ hour
Ge#ng worse – expect 30-‐50% data growth
Looked for 9 months for a cost-‐effec4ve solu4on
Solu5on Diagram
Ini5al Results
¼ cost with commodity scale out
3-‐7x faster through parallelized queries
10-‐20x price/perf with no applica4on, BI or ETL rewrites
Cross-Channel Campaigns
Real-Time Personalization
Real-Time Actions
15
Proven Building Blocks: Hadoop and Derby
APACHE DERBY § ANSI SQL-‐99 RDBMS § Java-‐based § ODBC/JDBC Compliant
APACHE HBASE/HDFS § Auto-‐sharding § Real-‐4me updates § Fault-‐tolerance § Scalability to 100s of PBs § Data replica4on
Typical Database Workloads
16
Opera=onal Applica=ons
Opera=onal Repor=ng & Analy=cs
Ad-‐Hoc Analy=cs Enterprise Data Warehouses
Typical Databases
• MySQL • Oracle • MongoDB
• MySQL • Oracle
• Greenplum • Paraccel • Netezza
• Teradata • Oracle • Sybase IQ
Use Cases • OLTP -‐ ERP, CRM • Websites
• Opera4onal Datastores
• Exploratory Analy4cs • Data Mining
• Enterprise Repor4ng
Typical Users • Customers • Opera4onal
Employees
• Opera4onal Employees
• Analysts • Data Scien4sts
• Managers • Execu4ves
Workload Strengths
• High concurrency of small reads/ writes
• Range queries
• Parameterized reports against real-‐4me data
• Range queries
• Complex queries requiring full table scans
• Parameterized reports against historical data
17
Internet of Things
Opera4onal Data Lake Digital Marke4ng
Personalized Medicine
Use Cases
Splice Machine | Proprietary & Confiden4al
Fraud Detec4on
18
Opera4onal Data Lake: Great On-‐Ramp to Big Data
§ Clear Business Value Now
§ Replace obsolete Opera4onal Data Stores (ODSs) § Exis4ng use cases – not just a science project § Hadoop RDBMS – inexpensive to store data
§ Incremental On-‐Ramp to Big Data § Start with structured data and then expand to unstructured
§ Add schema when needed
Ge#ng Started with Hadoop: Opera4onal Data Lake
Rich Reimer VP, Product Management