19
1 ©MapR Technologies - Confidential Data Warehouse Offload (ETL and ELT and Preprocessing, Oh My!)

Data Warehouse Offload

Embed Size (px)

DESCRIPTION

Presented at BigData.SG, October 2013

Citation preview

Page 1: Data Warehouse Offload

1©MapR Technologies - Confidential

Data Warehouse Offload(ETL and ELT and Preprocessing, Oh My!)

Page 2: Data Warehouse Offload

2©MapR Technologies - Confidential

Introduce Myself

John Berns, Solutions Architect, APAC for MapR

I’ve been involed in Big Data for three years, using Hadoop for two.

(I go waaaaay back!)

I’m also co-founder of BigData.SG and Hadoop.SG http://bigdata.sg http://hadoop.sg

I’m a Hadoop nerd—and proud of it.

Page 3: Data Warehouse Offload

3©MapR Technologies - Confidential

Traditional Data Warehouse

Page 4: Data Warehouse Offload

4©MapR Technologies - Confidential

Arrival of Big Data impacts DW

BIG DATA

Volume

Variety

Velocity

Prohibitively expensive storage costs

Inability to process unstructured formats

Faster arrival and processing needs

DW needs to accommodate Big Data

Page 5: Data Warehouse Offload

5©MapR Technologies - Confidential

Scaling the Data Warehouse-MPP Databases

Page 6: Data Warehouse Offload

6©MapR Technologies - Confidential

But There Are Some Problems Scaling

Cost – Data Warehouse costs $$$,000’s per terabyte Works only on relational data; doesn’t like unstructured data Fixed schema—you can only query the data in ways that are

predefined by the existing schema

Page 7: Data Warehouse Offload

7©MapR Technologies - Confidential

Accommodating Big Data

RDBMS

Sensor Data

Web Logs

Hadoop

RDBMS

• Only structured data• $50K – 100K per TB• Limited Analytics

Both structured and unstructured data50x-100x cost savings: $1K per TBExpanded analytics with MapReduce, NoSQL etc.

FROM

TO

DW

DWETL + Long Term Storage Query + Present

HadoopETL + Long Term Storage

Page 8: Data Warehouse Offload

8©MapR Technologies - Confidential

Data Warehouse Meets Big Data

Use ELT to handle semi-structured (or even unstructured) data ELT applies structure after the data is loaded Use compute power to do the transformation Can be done in parallel—that’s what Hadoop is good for! ELT for ETL – process semi-structured data & save structured data Connect via ODBC or JDBC and execute queries on the fly

Page 9: Data Warehouse Offload

9©MapR Technologies - Confidential

ELT: Applying Schema on LoadCREATE TABLE apachelog (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"

)

STORED AS TEXTFILE;

Page 10: Data Warehouse Offload

10©MapR Technologies - Confidential

Read Semi-Structured Data & CreateStructure

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

host 127.0.0.1

identity 1001

user frank

time 10/Oct/2000:13:55:36 -0700

request GET /apache_pb.gif HTTP/1.0

status 200

size 2326

Page 11: Data Warehouse Offload

11©MapR Technologies - Confidential

Accommodating Big Data

RDBMS

Sensor Data

Web Logs

Hadoop

RDBMS

• Only structured data• $50K – 100K per TB• Limited Analytics

Both structured and unstructured data50x-100x cost savings: $1K per TBExpanded analytics with MapReduce, NoSQL etc.

FROM

TO

DW

DWETL + Long Term Storage Query + Present

HadoopETL + Long Term Storage

Page 12: Data Warehouse Offload

12©MapR Technologies - Confidential

MapR Strengths for DW Offload

Best ROI• 2x Performance• No custom connectors• Unlimited scale

Easiest Integration• Works with existing tools• Streaming ingestion and

extraction

Enterprise Grade Platform• 99.999% HA• Full data protection• Disaster recovery

Page 13: Data Warehouse Offload

13©MapR Technologies - Confidential

MapR Customer Case Study

Teradata Teradata

OLD NEW

• All ETL steps done in Teradata• Cost prohibitive scaling• Data warehouse team not able to

handle new data formats

• Replaced 5 out of 7 ETL steps • Only hot data is stored in EDW• Existing applications not affected• Extensively leverage NFS to

directly ingest data into Teradata

Large Telecom CompanyDeployed Billing applications using TeradataHundreds of users and applications across the enterprise

Hadoop

Page 14: Data Warehouse Offload

14©MapR Technologies - Confidential

Lots of Data Lots of Scans Across Large Sets Throughput Important

Data ShapeTelecom

Page 15: Data Warehouse Offload

15©MapR Technologies - Confidential

ETLCDR billing

records

Billing reports

Data Warehouse

Customer bills

Original Flow – ELTL

Page 16: Data Warehouse Offload

16©MapR Technologies - Confidential

ETLCDR billing

records

Billing reports

Data Warehouse

Customer billing

With ETL Offload

Page 17: Data Warehouse Offload

17©MapR Technologies - Confidential

Price Performance

EDW strategy–1.5x performance–$30 million

MapR Strategy–3x performance–$3 million

20x cost/performance advantage for MapR strategy

Page 18: Data Warehouse Offload

18©MapR Technologies - Confidential

Business Impact: Saved $30M in 5 year TCO Able to store all data and have a scalable architecture for future Do not have to maintain any special connectors A happy Ops team enhancing services for its internal customers with MapReduce Implemented the change without impacting internal users

MapR Customer Case Study continued

Page 19: Data Warehouse Offload

19©MapR Technologies - Confidential

Wrapping It Up…

My contact info:

[email protected]://www.linkedin.com/in/jfxberns

Find the slides at:

http://www.slideshare.net

Whitepaper with mode details on Data Warehouse Offload:

http://www.mapr.com/solutions/data-warehouse-offload