Upload
inside-analysis
View
415
Download
0
Embed Size (px)
Citation preview
Grab s-me c-ffee and enj-y the pre!sh-w banter
bef-re the t-p -f the h-ur
The Briefing Room
First in Class: Optimizing the Data Lake for Tighter Integration
Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise software, good and bad
Provide a forum for detailed analysis of today’s innovative technologies
Give vendors a chance to explain their product to savvy analysts
Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
October: DATA MANAGEMENT
November: ANALYTICS
December: INNOVATORS
Twitter Tag: #briefr The Briefing Room
What Goes In, Should Come Out
! Well Begun = Half Done ! Smart Architecture > Clever Queries ! Low Cost for Planning < Optimal ! Schema on Read ≠ Haphazard Ingestion
Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
Twitter Tag: #briefr The Briefing Room
Teradata RainStor
Teradata RainStor is well known for its data archiving solutions
Its capabilities include an archive on Hadoop’s HDFS, which allows for SQL queries over the archive
When combined with Hadoop, Teradata RainStor can enable an optimized data lake capable of storing raw data and acting as an enterprise system of record
Twitter Tag: #briefr The Briefing Room
Guest: Mark Cusack
Mark joined Teradata in 2014 as part of its RainStor acquisition. As a founding developer and Chief Architect at RainStor, he has worked on many different aspects of the product since 2004. Most recently, he led the efforts to integrate RainStor with Hadoop and with Teradata. He was formerly a senior scientist and team lead at QinetiQ, where he researched distributed simulation techniques and developed physics-based models of human behavior to support military training and operations. He also led government and industry projects in the areas of grid and pervasive computing. Before joining QinetiQ, Mark worked in academia, where he combined cluster computing methods with quantum mechanics to predict the properties of semiconductor microstructures. Mark holds a Masters in Computing and a PhD in Physics from Newcastle University.
Teradata RainStor® for the Data Lake
2 © 2015 Teradata
• Cost Savings – Convert CapEx to OpEx – Decrease storage footprint – Future-proof capacity
• Fast Flexible Access – Standards based – Compression optimizes queries
• Data Governance – Privacy – Security – Integrity
Teradata RainStor® – The Structured Data Lake Foundation Simple, Efficient, Scalable, Cost-Effective
Teradata RainStor is the most efficient, scalable and accessible way to store structured or semi-structured data in your data lake
INGEST At Network
Speed
COMPRESS 50-80%
Cluster Reduction
ANALYZE 10-100%
Performance Boost
RainStor Partitions HDFS Files...
3 © 2015 Teradata
Data Lake Use Cases Applicable to RainStor
• System of Record for Structured Data – Provide a trusted source of data with tracking – Meet commercial and regulatory requirements
• Archive for Structured Data – Offload historical data – Central control of restore capabilities
• Discovery – Data profiling to discover correlations
• Analysis – Custom analytics, signal analysis, and event reporting
• ETL Remix – Staging platform for data cleanup prior to EDW analysis
4 © 2015 Teradata
Archive or a System of Record
Depends on the position of RainStor with respect to the source
Warehouse or Database
Warehouse or Database
Archive
System of
Record Source
Source RainStor
RainStor
5 © 2015 Teradata
QUERY SQL
BI Tools: Hive, MapReduce
SCALE – Any Platform (MPP, Shared Everything)
COMPRESS LOAD
Billions Records/Day
10-40X (90%+)
AVAILABILITY Replication
EDW/DB
GOVERN
Rules Based
SECURE – Enterprise-Grade
Network
Tape
Hadoop NAS, CAS
Apps
MOVE
Teradata RainStor® Overview
6 © 2015 Teradata
Challenges – Log, clickstream, and sensor data,
tax import systems – Encryption is expensive – Extended wait times for data access – Maintaining data integrity
RainStor Solutions – MPP (scalable data load) – Encryption of compressed data – Data immediately available for query – Fingerprinted
Data Collection
Process
RainStor Node 1
Service Manager
Load
Query
Fork
RainStor Node N
HDFS Data Node
“Move your costs by a decimal point!” ~ Architect, Global Financial Services Company
How does RainStor do this? – Separately stages
delimited text data – De-dupes and builds
partitions stored on HDFS – Shows multi-node query
process view of data across HDFS
Source
Staging Area
Data Load
7 © 2015 Teradata
Challenges – Storage costs outstrip budgets – Queries take longer than ever
RainStor Solutions ! Patented compression techniques
! In-memory and on disk compression is performance multiplier and storage saver
! Stored in binary tree format
! Algorithms that query compressed data
! Hardware & bandwidth multiplier
! 2-10X more compressed than ORC
! Cost saving on floor, cooling and personnel
! Drives efficient query execution framework
! CPU rather than IO bound
How does RainStor do this?
“Now we can keep years of history that wasn’t economically feasible until now.”
~ Architect, Communication Service Provider
Compression
0001
0002 200
100
0003
0004
$12
$13
AA
BAC
Stock Trades Example
8 © 2015 Teradata
Challenges – Query speed is #1 concern – Hive queries aren’t standards based – Rewriting queries is a huge task – Data transparency
RainStor Solutions ! SQL access – 2-10x faster than Hive ! Improved Hive performance ! Efficient parallel query execution ! User defined functions supported ! Teradata connectivity via QueryGrid ! BI Tool access ! Query access via HCatalog
Query
P
P
P
P
P
P
P
Sta
tic
Metadata SQL
Hive
Pig
HCatalog
Predicates
Bloom Filter
Dyn
am
ic
Fields Stats Types
P
P
P
P
PHDFS
“RainStor doesn’t care what hardware it runs on. It’s just as good on Tier-2 or Tier-3 hardware.”
~ Chief Architect, Global Investment Bank
How does RainStor do this?
9 © 2015 Teradata
RainStor Governance
• Data Encryption • Data Masking • Log Masking • View-Based Dynamic Masking
• Authentication – Kerberos – LDAP/AD – Linux PAM
• SQL92 Authorization
• Immutable Data Model • Record-Level Delete • Schema Evolution • Data Disposition • Replication • Audit Trail
Privacy
Security Integrity
Designed to support PCI-DSS, SEC17a-4, etc.
“How did you guys get it right and others didn’t.” ~ Architect, U.S. Bank
10 © 2015 Teradata
RainStor 7 Architecture
Apache Teradata RainStor®
Teradata IDW & Tools
RainStor Files
MapReduce
Teradata BAR
Teradata IDW
Hive
Pig Java
HCatalog
MapReduce / YARN
Teradata QueryGrid™
Interactive SQL
Oracle, SQLServer, SybaseIQ, Netezza extensions
ODBC JDBC
Data Loader FastConnect™ FastForward™
!
HDFS (CDH/HDS)
Management Alerting
Security
Retention Rules
Replication
Compliance !
NAS, CAS, SAN, WORM
Vendor specific
User-Defined Functions
11 © 2015 Teradata
Integration with Teradata QueryGrid
TERADATA ASTER
RAINSTOR ON HADOOP
TERADATA DATABASE
HADOOP OTHER DATABASES
TD QueryGrid Support for RainStor
Business users Data scientists
12 © 2015 Teradata
RainStor Integration with Teradata
FastForward™
FastConnect™
10111001001010110 10111001001010110 10111001001010110
BAR PBs of history
QueryGrid™
10111001001010110 10111001001010110
PS Engagement
RainStor
13 © 2015 Teradata
US Telco Case Study #1: Data Lake Network Performance
Problem • Storage & analysis of network events – Performance, faults, changes
Challenges • 50TB raw data/day • Demanding query SLAs Results • Storing 30 days data – up from 3 days • 8 node Hadoop cluster • 83% reduction in storage footprint • Data lake system of record
Dual Load
RainStor
“RainStor addresses data growth at the root cause.” ~ Architect, U.S. Bank
Network Events
14 © 2015 Teradata
US Telco Case Study #2: Data Lake Compliant and Secure Analytics
Problem • Usage data must be encrypted on Hadoop
• Avoid any query performance impact
Challenge
• Deliver cost-effective & secure scalability
Solution
• RainStor 15x compression vs. ORC 7x • Encryption with only 3% query overhead
• Queries 3X faster than Hive
Clickstream/Usage Data
Network
Customer Data Extract/
Scrubbing
1.2PB 62 Nodes Running Hortonworks 2.1
RainStor
“We keep finding new stuff we can do with RainStor! We are just getting rolling! ~ Principal Architect, Global CSP
15 © 2015 Teradata
US Telco Case Study #3: Data Lake Application Retirement & Access
Problem • Hundreds of applications taking up space Challenges
• Needed to lower TCO • Free up capacity and maintain user access Results
• Hundreds of apps retired into RainStor • Users access data using BI tool of their choice
• Administration is minimal on low cost NAS • Saving $800K for every 100TB stored in RainStor
After Before
“I installed RainStor in less than 5 minutes and was querying the data 30 minutes later.” ~ Principal Architect, U.S. Telco
RainStor
16 © 2015 Teradata
• Cost Savings – Convert CapEx to OpEx – Decrease storage footprint – Future-proof capacity
• Fast Flexible Access – Standards based – Compression optimizes queries
• Data Governance – Privacy – Security – Integrity
Teradata RainStor® – The Structured Data Lake Foundation Simple, Efficient, Scalable, Cost-Effective
Teradata RainStor is the most efficient, scalable and accessible way to store structured or semi-structured data in your data lake
INGEST At Network
Speed
COMPRESS 50-80%
Cluster Reduction
ANALYZE 10-100%
Performance Boost
RainStor Partitions HDFS Files...
17 © 2015 Teradata
Backup Slides
19 © 2015 Teradata
Teradata Appliance for Hadoop
• Future-proof capacity – 2x to 8x more compressed, including ORC
• Fast analysis (2x to 100x performance boost) – Mature SQL stack - Multiple parsers – Oracle, SQL Server, Sybase
– Fast Hive QL, Pig, MapReduce – Support for BI tool
• Security and compliance – Encryption – LDAP/AD/PAM/Kerberos/PCI – SQL92 users, tables, views, and data masking – Audit trails & logging
• Life cycle management – Retention rules & expiry policies – Schema evolution
• Faster time-to-value
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
The Quality of the Data Lake
Robin Bloor, PhD
The Departure Point
A data lake is a WHOLLY NEW architectural idea
Pouring Data Into the Lake
But Not Much Changed
" Nothing changed in respect to enterprise operational discipline
" Nothing changed in respect to service level policy
" Nothing changed in respect to data governance (although it may have gotten more demanding)
" Possibly the data got dirtier " Security became more onerous " Some things became more onerous " Data volumes increased
Hadoop: Good, Bad, Ugly
" GOOD: scalability and parallelism, some components (like Kafka and Presto), costs
" BAD: security, lack of system management components, some components (like Hive)
" UGLY: Lack of stability, a servant with three masters, skills and experience, cultural issues
The Consequence
You need to make sensible COMPONENT decisions and sensible
ARCHITECTURAL decisions
" Can RainStor simply be used as a SQL-capable query-only database sitting on Hadoop? What are the gating factors?
" How fast is data ingest? Are there any limits to how this is done?
" What is the data compression limitation, if any? How much space would be saved over Hive or HBase?
" Walk me through a data lake implementation.
" Is there any Hadoop distribution that you prefer, or doesn’t it matter?
" What if I’m not a Teradata user? Is there any downside to using RainStor?
Twitter Tag: #briefr The Briefing Room
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
October: DATA MANAGEMENT
November: ANALYTICS
December: INNOVATORS
Twitter Tag: #briefr The Briefing Room
THANK YOU for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons