Upload
john-berns
View
126
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented at BigData.SG, October 2013
Citation preview
1©MapR Technologies - Confidential
Data Warehouse Offload(ETL and ELT and Preprocessing, Oh My!)
2©MapR Technologies - Confidential
Introduce Myself
John Berns, Solutions Architect, APAC for MapR
I’ve been involed in Big Data for three years, using Hadoop for two.
(I go waaaaay back!)
I’m also co-founder of BigData.SG and Hadoop.SG http://bigdata.sg http://hadoop.sg
I’m a Hadoop nerd—and proud of it.
3©MapR Technologies - Confidential
Traditional Data Warehouse
4©MapR Technologies - Confidential
Arrival of Big Data impacts DW
BIG DATA
Volume
Variety
Velocity
Prohibitively expensive storage costs
Inability to process unstructured formats
Faster arrival and processing needs
DW needs to accommodate Big Data
5©MapR Technologies - Confidential
Scaling the Data Warehouse-MPP Databases
6©MapR Technologies - Confidential
But There Are Some Problems Scaling
Cost – Data Warehouse costs $$$,000’s per terabyte Works only on relational data; doesn’t like unstructured data Fixed schema—you can only query the data in ways that are
predefined by the existing schema
7©MapR Technologies - Confidential
Accommodating Big Data
RDBMS
Sensor Data
Web Logs
Hadoop
RDBMS
• Only structured data• $50K – 100K per TB• Limited Analytics
Both structured and unstructured data50x-100x cost savings: $1K per TBExpanded analytics with MapReduce, NoSQL etc.
FROM
TO
DW
DWETL + Long Term Storage Query + Present
HadoopETL + Long Term Storage
8©MapR Technologies - Confidential
Data Warehouse Meets Big Data
Use ELT to handle semi-structured (or even unstructured) data ELT applies structure after the data is loaded Use compute power to do the transformation Can be done in parallel—that’s what Hadoop is good for! ELT for ETL – process semi-structured data & save structured data Connect via ODBC or JDBC and execute queries on the fly
9©MapR Technologies - Confidential
ELT: Applying Schema on LoadCREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE;
10©MapR Technologies - Confidential
Read Semi-Structured Data & CreateStructure
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
host 127.0.0.1
identity 1001
user frank
time 10/Oct/2000:13:55:36 -0700
request GET /apache_pb.gif HTTP/1.0
status 200
size 2326
11©MapR Technologies - Confidential
Accommodating Big Data
RDBMS
Sensor Data
Web Logs
Hadoop
RDBMS
• Only structured data• $50K – 100K per TB• Limited Analytics
Both structured and unstructured data50x-100x cost savings: $1K per TBExpanded analytics with MapReduce, NoSQL etc.
FROM
TO
DW
DWETL + Long Term Storage Query + Present
HadoopETL + Long Term Storage
12©MapR Technologies - Confidential
MapR Strengths for DW Offload
Best ROI• 2x Performance• No custom connectors• Unlimited scale
Easiest Integration• Works with existing tools• Streaming ingestion and
extraction
Enterprise Grade Platform• 99.999% HA• Full data protection• Disaster recovery
13©MapR Technologies - Confidential
MapR Customer Case Study
Teradata Teradata
OLD NEW
• All ETL steps done in Teradata• Cost prohibitive scaling• Data warehouse team not able to
handle new data formats
• Replaced 5 out of 7 ETL steps • Only hot data is stored in EDW• Existing applications not affected• Extensively leverage NFS to
directly ingest data into Teradata
Large Telecom CompanyDeployed Billing applications using TeradataHundreds of users and applications across the enterprise
Hadoop
14©MapR Technologies - Confidential
Lots of Data Lots of Scans Across Large Sets Throughput Important
Data ShapeTelecom
15©MapR Technologies - Confidential
ETLCDR billing
records
Billing reports
Data Warehouse
Customer bills
Original Flow – ELTL
16©MapR Technologies - Confidential
ETLCDR billing
records
Billing reports
Data Warehouse
Customer billing
With ETL Offload
17©MapR Technologies - Confidential
Price Performance
EDW strategy–1.5x performance–$30 million
MapR Strategy–3x performance–$3 million
20x cost/performance advantage for MapR strategy
18©MapR Technologies - Confidential
Business Impact: Saved $30M in 5 year TCO Able to store all data and have a scalable architecture for future Do not have to maintain any special connectors A happy Ops team enhancing services for its internal customers with MapReduce Implemented the change without impacting internal users
MapR Customer Case Study continued
19©MapR Technologies - Confidential
Wrapping It Up…
My contact info:
[email protected]://www.linkedin.com/in/jfxberns
Find the slides at:
http://www.slideshare.net
Whitepaper with mode details on Data Warehouse Offload:
http://www.mapr.com/solutions/data-warehouse-offload