34
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Hadoop World 2011 November 8 | 2011

Extending the Data Warehouse with Hadoop - Hadoop world 2011

Embed Size (px)

DESCRIPTION

Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.

Citation preview

Page 1: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Extending the Enterprise Data Warehouse with Hadoop

Robert Lancaster and Jonathan Seidman

Hadoop World 2011

November 8 | 2011

Page 2: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Who We Are

•  Robert Lancaster

– Solutions Architect, Hotel Supply Team

–  [email protected]

– @rob1lancaster

– Co-organizer of Chicago Big Data and Chicago Machine Learning Study Group.

•  Jonathan Seidman

–  Lead Engineer, Business Intelligence/Big Data Team

– Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data

–  [email protected]

– @jseidman

page 2

Page 3: Extending the Data Warehouse with Hadoop - Hadoop world 2011

page 3

Launched in 2001

Over 160 million bookings

Page 4: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Some History…

page 4

Page 5: Extending the Data Warehouse with Hadoop - Hadoop world 2011

In 2009…

•  The Machine Learning team is formed to improve site performance. For example, improving hotel search results.

•  This required access to large volumes of behavioral data for analysis.

–  Fortunately, the required data was collected in session data stored in web analytics logs.

page 5

Page 6: Extending the Data Warehouse with Hadoop - Hadoop world 2011

The Problem…

•  The only archive of the required data went back about two weeks.

page 6

Transactional data (e.g. bookings) and

aggregated Non-transactional data

Data Warehouse

Non-transactional Data (e.g. searches)

Page 7: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Hadoop Provided a Solution…

page 7

Data Warehouse

Detailed non-transactional data

(what every user sees, clicks, etc.)

Hadoop

Transactional data (e.g. bookings) and

aggregated Non-transactional data

Page 8: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Deploying Hadoop Enabled Multiple Applications…

page 8

2.78%

34.30% 31.87%

71.67%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Queries

Searches

Page 9: Extending the Data Warehouse with Hadoop - Hadoop world 2011

And Useful Analyses…

page 9

Page 10: Extending the Data Warehouse with Hadoop - Hadoop world 2011

But Brought New Challenges…

•  Most of these efforts are driven by development teams.

•  The challenge now is unlocking the value of this data for non-technical users.

page 10

Page 11: Extending the Data Warehouse with Hadoop - Hadoop world 2011

In Early 2011…

•  Big Data team is formed under Business Intelligence team at Orbitz Worldwide.

•  Allows the Big Data team to work more closely with the data warehouse and BI teams.

•  Reflects the importance of big data to the future of the company.

page 11

Page 12: Extending the Data Warehouse with Hadoop - Hadoop world 2011

A View Shared Beyond Orbitz…

page 12

“We strongly believe that Hadoop is the nucleus of the next-generation cloud EDW…”

*James Kobielus, Forrester Research, “Hadoop, Is It Soup Yet?”

“…but that promise is still three to five years from fruition.”*

Page 13: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Two Primary Ways We Use Hadoop to Complement the EDW

•  Extraction and transformation of data for loading into the data warehouse – “ETL”.

•  Off-loading of analysis from the data warehouse.

page 13

Page 14: Extending the Data Warehouse with Hadoop - Hadoop world 2011

ETL Example: Proposed Dimensional Model

page 14

Raw logs Hadoop Dimensional model

Page 15: Extending the Data Warehouse with Hadoop - Hadoop world 2011

ETL Example: Click Data Processing

page 15

Web Server

Logs ETL DW

Data Cleansing (Stored procedure)

DW Web Server Web Servers

Several hours of processing ~20% original data size

Current Processing in Data Warehouse

Page 16: Extending the Data Warehouse with Hadoop - Hadoop world 2011

ETL Example: Click Data Processing

•  Moving to Hadoop will facilitate:

– Remove load from the data warehouse.

– Adding additional attributes for processing.

– Allow processing to be run more frequently.

page 16

Web Server

Logs Hadoop

Data Cleansing (MapReduce) DW

Web Server Web Servers

Proposed Processing in Hadoop

Page 17: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Analysis Example: Geo-Targeting Ads

•  Facilitated analysis that allows for more personalized ad content.

•  Allowed marketing team to analyze over a years worth of search data.

•  Provided analysis that was difficult to perform in the data warehouse.

page 17

Page 18: Extending the Data Warehouse with Hadoop - Hadoop world 2011

BI Vendors Are Working on Hadoop Integration

page 18

Both big (relatively)…

Page 19: Extending the Data Warehouse with Hadoop - Hadoop world 2011

And small…

page 19

Page 20: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Example Processing Pipeline for Web Analytics Data

page 20

Page 21: Extending the Data Warehouse with Hadoop - Hadoop world 2011

page 21

Example Use Case: Selection Errors

Page 22: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – Selection Errors: Introduction

•  Multiple points of entry.

•  Multiple paths through site.

•  Goal: tie events together to form picture of customer behavior.

page 22

Page 23: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – Selection Errors: Processing

page 23

Page 24: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – Selection Errors: Visualization

page 24

Page 25: Extending the Data Warehouse with Hadoop - Hadoop world 2011

page 25

Example Use Case: Beta Data

Page 26: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – Beta Data: Introduction

•  Hotel Sort Optimization

•  Compare A vs. B

•  Web Analytics Data –  What user saw.

–  How user behaved

•  Server Log Data –  Sorting behavior used.

page 26

Page 27: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – Beta Data Processing

page 27

Page 28: Extending the Data Warehouse with Hadoop - Hadoop world 2011

page 28

Use Case – Beta Data: Visualization

Page 29: Extending the Data Warehouse with Hadoop - Hadoop world 2011

page 29

Example Use Case: RCDC

Page 30: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – RCDC: Introduction

•  Understand and improve cache behavior.

•  Improve “coverage”

–  Traditionally search 1 page of hotels at a time.

– Get “just enough” information to present to consumers.

–  Increase amount of availability information we have when consumer performs a search.

•  Data needed to support needs beyond reporting.

page 30

Page 31: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – RCDC: Processing

page 31

Page 32: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Use Case – RCDC: Visualization

page 32

Page 33: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Conclusions

•  Hadoop market is still immature, but growing quickly. Better tools are on the way.

–  Look beyond the usual (enterprise) suspects. Many of the most interesting companies in the big data space are small startups.

•  Hadoop will not replace your data warehouse, but any organization with a large data warehouse should at least be exploring Hadoop as a complement to their BI infrastructure.

page 33

Page 34: Extending the Data Warehouse with Hadoop - Hadoop world 2011

Conclusions

•  Work closely with your existing data management teams.

– Your idea of what constitutes “big data” might quickly diverge from theirs.

•  The flip-side to this is that Hadoop can be an excellent tool to off-load resource-consuming jobs from your data warehouse.

page 34