40
Grab sme cffee and enjy the pre!shw banter befre the tp f the hur

First in Class: Optimizing the Data Lake for Tighter Integration

Embed Size (px)

Citation preview

Page 1: First in Class: Optimizing the Data Lake for Tighter Integration

Grab s-­me c-­ffee and enj-­y the pre!sh-­w banter

bef-­re the t-­p -­f the h-­ur

Page 2: First in Class: Optimizing the Data Lake for Tighter Integration

The Briefing Room

First in Class: Optimizing the Data Lake for Tighter Integration

Page 3: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Page 4: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission

Page 5: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Topics

October: DATA MANAGEMENT

November: ANALYTICS

December: INNOVATORS

Page 6: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

What Goes In, Should Come Out

! Well Begun = Half Done !  Smart Architecture > Clever Queries !  Low Cost for Planning < Optimal !  Schema on Read ≠ Haphazard Ingestion

Page 7: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor

Page 8: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Teradata RainStor

  Teradata RainStor is well known for its data archiving solutions

  Its capabilities include an archive on Hadoop’s HDFS, which allows for SQL queries over the archive

  When combined with Hadoop, Teradata RainStor can enable an optimized data lake capable of storing raw data and acting as an enterprise system of record

Page 9: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Guest: Mark Cusack

Mark joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004. Most recently, he led the efforts to integrate RainStor with Hadoop and with Teradata. He was formerly a senior scientist and team lead at QinetiQ, where he researched distributed simulation techniques and developed physics-based models of human behavior to support military training and operations. He also led government and industry projects in the areas of grid and pervasive computing. Before joining QinetiQ, Mark worked in academia, where he combined cluster computing methods with quantum mechanics to predict the properties of semiconductor microstructures. Mark holds a Masters in Computing and a PhD in Physics from Newcastle University.

Page 10: First in Class: Optimizing the Data Lake for Tighter Integration

Teradata RainStor® for the Data Lake

Page 11: First in Class: Optimizing the Data Lake for Tighter Integration

2 © 2015 Teradata

•  Cost Savings –  Convert CapEx to OpEx –  Decrease storage footprint –  Future-proof capacity

•  Fast Flexible Access –  Standards based –  Compression optimizes queries

•  Data Governance –  Privacy –  Security –  Integrity

Teradata RainStor® – The Structured Data Lake Foundation Simple, Efficient, Scalable, Cost-Effective

Teradata RainStor is the most efficient, scalable and accessible way to store structured or semi-structured data in your data lake

INGEST At Network

Speed

COMPRESS 50-80%

Cluster Reduction

ANALYZE 10-100%

Performance Boost

RainStor Partitions HDFS Files...

Page 12: First in Class: Optimizing the Data Lake for Tighter Integration

3 © 2015 Teradata

Data Lake Use Cases Applicable to RainStor

•  System of Record for Structured Data –  Provide a trusted source of data with tracking –  Meet commercial and regulatory requirements

•  Archive for Structured Data –  Offload historical data –  Central control of restore capabilities

•  Discovery –  Data profiling to discover correlations

•  Analysis –  Custom analytics, signal analysis, and event reporting

•  ETL Remix –  Staging platform for data cleanup prior to EDW analysis

Page 13: First in Class: Optimizing the Data Lake for Tighter Integration

4 © 2015 Teradata

Archive or a System of Record

Depends on the position of RainStor with respect to the source

Warehouse or Database

Warehouse or Database

Archive

System of

Record Source

Source RainStor

RainStor

Page 14: First in Class: Optimizing the Data Lake for Tighter Integration

5 © 2015 Teradata

QUERY SQL

BI Tools: Hive, MapReduce

SCALE – Any Platform (MPP, Shared Everything)

COMPRESS LOAD

Billions Records/Day

10-40X (90%+)

AVAILABILITY Replication

EDW/DB

GOVERN

Rules Based

SECURE – Enterprise-Grade

Network

Tape

Hadoop NAS, CAS

Apps

MOVE

Teradata RainStor® Overview

Page 15: First in Class: Optimizing the Data Lake for Tighter Integration

6 © 2015 Teradata

Challenges –  Log, clickstream, and sensor data,

tax import systems –  Encryption is expensive –  Extended wait times for data access –  Maintaining data integrity

RainStor Solutions –  MPP (scalable data load) –  Encryption of compressed data –  Data immediately available for query –  Fingerprinted

Data Collection

Process

RainStor Node 1

Service Manager

Load

Query

Fork

RainStor Node N

HDFS Data Node

“Move your costs by a decimal point!” ~ Architect, Global Financial Services Company

How does RainStor do this? –  Separately stages

delimited text data –  De-dupes and builds

partitions stored on HDFS –  Shows multi-node query

process view of data across HDFS

Source

Staging Area

Data Load

Page 16: First in Class: Optimizing the Data Lake for Tighter Integration

7 © 2015 Teradata

Challenges –  Storage costs outstrip budgets –  Queries take longer than ever

RainStor Solutions ! Patented compression techniques

! In-memory and on disk compression is performance multiplier and storage saver

! Stored in binary tree format

! Algorithms that query compressed data

! Hardware & bandwidth multiplier

! 2-10X more compressed than ORC

! Cost saving on floor, cooling and personnel

! Drives efficient query execution framework

! CPU rather than IO bound

How does RainStor do this?

“Now we can keep years of history that wasn’t economically feasible until now.”

~ Architect, Communication Service Provider

Compression

0001

0002 200

100

0003

0004

$12

$13

AA

BAC

Stock Trades Example

Page 17: First in Class: Optimizing the Data Lake for Tighter Integration

8 © 2015 Teradata

Challenges –  Query speed is #1 concern –  Hive queries aren’t standards based –  Rewriting queries is a huge task –  Data transparency

RainStor Solutions ! SQL access – 2-10x faster than Hive ! Improved Hive performance ! Efficient parallel query execution ! User defined functions supported ! Teradata connectivity via QueryGrid ! BI Tool access ! Query access via HCatalog

Query

P

P

P

P

P

P

P

Sta

tic

Metadata SQL

Hive

Pig

HCatalog

Predicates

Bloom Filter

Dyn

am

ic

Fields Stats Types

P

P

P

P

PHDFS

“RainStor doesn’t care what hardware it runs on. It’s just as good on Tier-2 or Tier-3 hardware.”

~ Chief Architect, Global Investment Bank

How does RainStor do this?

Page 18: First in Class: Optimizing the Data Lake for Tighter Integration

9 © 2015 Teradata

RainStor Governance

• Data Encryption • Data Masking •  Log Masking • View-Based Dynamic Masking

• Authentication –  Kerberos –  LDAP/AD –  Linux PAM

•  SQL92 Authorization

•  Immutable Data Model •  Record-Level Delete •  Schema Evolution • Data Disposition •  Replication • Audit Trail

Privacy

Security Integrity

Designed to support PCI-DSS, SEC17a-4, etc.

“How did you guys get it right and others didn’t.” ~ Architect, U.S. Bank

Page 19: First in Class: Optimizing the Data Lake for Tighter Integration

10 © 2015 Teradata

RainStor 7 Architecture

Apache Teradata RainStor®

Teradata IDW & Tools

RainStor Files

MapReduce

Teradata BAR

Teradata IDW

Hive

Pig Java

HCatalog

MapReduce / YARN

Teradata QueryGrid™

Interactive SQL

Oracle, SQLServer, SybaseIQ, Netezza extensions

ODBC JDBC

Data Loader FastConnect™ FastForward™

!

HDFS (CDH/HDS)

Management Alerting

Security

Retention Rules

Replication

Compliance !

NAS, CAS, SAN, WORM

Vendor specific

User-Defined Functions

Page 20: First in Class: Optimizing the Data Lake for Tighter Integration

11 © 2015 Teradata

Integration with Teradata QueryGrid

TERADATA ASTER

RAINSTOR ON HADOOP

TERADATA DATABASE

HADOOP OTHER DATABASES

TD QueryGrid Support for RainStor

Business users Data scientists

Page 21: First in Class: Optimizing the Data Lake for Tighter Integration

12 © 2015 Teradata

RainStor Integration with Teradata

FastForward™

FastConnect™

10111001001010110 10111001001010110 10111001001010110

BAR PBs of history

QueryGrid™

10111001001010110 10111001001010110

PS Engagement

RainStor

Page 22: First in Class: Optimizing the Data Lake for Tighter Integration

13 © 2015 Teradata

 US Telco Case Study #1: Data Lake  Network Performance

 Problem •  Storage & analysis of network events –  Performance, faults, changes

 Challenges •  50TB raw data/day •  Demanding query SLAs  Results •  Storing 30 days data – up from 3 days •  8 node Hadoop cluster •  83% reduction in storage footprint •  Data lake system of record

Dual Load

RainStor

“RainStor addresses data growth at the root cause.” ~ Architect, U.S. Bank

Network Events

Page 23: First in Class: Optimizing the Data Lake for Tighter Integration

14 © 2015 Teradata

 US Telco Case Study #2: Data Lake  Compliant and Secure Analytics

Problem •  Usage data must be encrypted on Hadoop

•  Avoid any query performance impact

Challenge

•  Deliver cost-effective & secure scalability

Solution

•  RainStor 15x compression vs. ORC 7x •  Encryption with only 3% query overhead

•  Queries 3X faster than Hive

Clickstream/Usage Data

Network

Customer Data Extract/

Scrubbing

1.2PB 62 Nodes Running Hortonworks 2.1

RainStor

“We keep finding new stuff we can do with RainStor! We are just getting rolling! ~ Principal Architect, Global CSP

Page 24: First in Class: Optimizing the Data Lake for Tighter Integration

15 © 2015 Teradata

 US Telco Case Study #3: Data Lake Application Retirement & Access

Problem •  Hundreds of applications taking up space Challenges

•  Needed to lower TCO •  Free up capacity and maintain user access Results

•  Hundreds of apps retired into RainStor •  Users access data using BI tool of their choice

•  Administration is minimal on low cost NAS •  Saving $800K for every 100TB stored in RainStor

After Before

“I installed RainStor in less than 5 minutes and was querying the data 30 minutes later.” ~ Principal Architect, U.S. Telco

RainStor

Page 25: First in Class: Optimizing the Data Lake for Tighter Integration

16 © 2015 Teradata

•  Cost Savings –  Convert CapEx to OpEx –  Decrease storage footprint –  Future-proof capacity

•  Fast Flexible Access –  Standards based –  Compression optimizes queries

•  Data Governance –  Privacy –  Security –  Integrity

Teradata RainStor® – The Structured Data Lake Foundation Simple, Efficient, Scalable, Cost-Effective

Teradata RainStor is the most efficient, scalable and accessible way to store structured or semi-structured data in your data lake

INGEST At Network

Speed

COMPRESS 50-80%

Cluster Reduction

ANALYZE 10-100%

Performance Boost

RainStor Partitions HDFS Files...

Page 26: First in Class: Optimizing the Data Lake for Tighter Integration

17 © 2015 Teradata

Page 27: First in Class: Optimizing the Data Lake for Tighter Integration

Backup Slides

Page 28: First in Class: Optimizing the Data Lake for Tighter Integration

19 © 2015 Teradata

Teradata Appliance for Hadoop

•  Future-proof capacity –  2x to 8x more compressed, including ORC

•  Fast analysis (2x to 100x performance boost) –  Mature SQL stack -  Multiple parsers – Oracle, SQL Server, Sybase

–  Fast Hive QL, Pig, MapReduce –  Support for BI tool

•  Security and compliance –  Encryption –  LDAP/AD/PAM/Kerberos/PCI –  SQL92 users, tables, views, and data masking –  Audit trails & logging

•  Life cycle management –  Retention rules & expiry policies –  Schema evolution

•  Faster time-to-value

Page 29: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

Page 30: First in Class: Optimizing the Data Lake for Tighter Integration

The Quality of the Data Lake

Robin Bloor, PhD

Page 31: First in Class: Optimizing the Data Lake for Tighter Integration

The Departure Point

A data lake is a WHOLLY NEW architectural idea

Page 32: First in Class: Optimizing the Data Lake for Tighter Integration

Pouring Data Into the Lake

Page 33: First in Class: Optimizing the Data Lake for Tighter Integration

But Not Much Changed

" Nothing changed in respect to enterprise operational discipline

" Nothing changed in respect to service level policy

" Nothing changed in respect to data governance (although it may have gotten more demanding)

" Possibly the data got dirtier " Security became more onerous " Some things became more onerous " Data volumes increased

Page 34: First in Class: Optimizing the Data Lake for Tighter Integration

Hadoop: Good, Bad, Ugly

" GOOD: scalability and parallelism, some components (like Kafka and Presto), costs

" BAD: security, lack of system management components, some components (like Hive)

" UGLY: Lack of stability, a servant with three masters, skills and experience, cultural issues

Page 35: First in Class: Optimizing the Data Lake for Tighter Integration

The Consequence

You need to make sensible COMPONENT decisions and sensible

ARCHITECTURAL decisions

Page 36: First in Class: Optimizing the Data Lake for Tighter Integration

"  Can RainStor simply be used as a SQL-capable query-only database sitting on Hadoop? What are the gating factors?

"  How fast is data ingest? Are there any limits to how this is done?

"  What is the data compression limitation, if any? How much space would be saved over Hive or HBase?

"  Walk me through a data lake implementation.

Page 37: First in Class: Optimizing the Data Lake for Tighter Integration

"  Is there any Hadoop distribution that you prefer, or doesn’t it matter?

"  What if I’m not a Teradata user? Is there any downside to using RainStor?

Page 38: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Page 39: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

Upcoming Topics

www.insideanalysis.com

October: DATA MANAGEMENT

November: ANALYTICS

December: INNOVATORS

Page 40: First in Class: Optimizing the Data Lake for Tighter Integration

Twitter Tag: #briefr The Briefing Room

THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons