70
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Best Practices for Using Hadoop as an Enterprise Data Hub Mike Ferguson – Intelligent Business Strategies Steve Wooledge – MapR June 18, 2014

MapR Enterprise Data Hub Webinar w/ Mike Ferguson

Embed Size (px)

DESCRIPTION

Data volumes have experienced explosive growth in recent years, and that data is being generated from sources that are increasingly complex and varied. Harnessing and refining value from this data requires a new approach as data extraction, transformation, and loading (ETL) becoming increasingly more costly and difficult to scale. Organizations are looking to leverage Hadoop as an enterprise data hub—also called a “data lake” or “data reservoir”—as a key component of their data architecture to augment their data warehouse, ETL and analytical systems in order to maximize their existing investments, reduce costs, and unlock new business value from their data. In this webinar, you will learn: Real-world examples that illustrate why Hadoop is the best low-cost data hub, data lake, or data landing zone (staging area) option for ETL processing Proof points that demonstrate advantages of Hadoop and its ability to scale to manage increasing data volumes and support exploratory big data analytics Proven best practices for a cost-effective, reliable way to implement a data management platform for your entire big data analytical ecosystem Hidden issues to be aware of in deploying your data hub/data lake

Citation preview

Page 1: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Best Practices for Using Hadoop as an Enterprise Data Hub Mike Ferguson – Intelligent Business Strategies Steve Wooledge – MapR June 18, 2014

Page 2: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

2

About Mike Ferguson

Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence, data management and enterprise business integration. With over 32 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates.

www.intelligentbusiness.biz [email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

Page 3: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

The Hadoop Data Refinery and Enterprise Data Hub

Mike Ferguson Managing Director Intelligent Business Strategies June 2014

Page 4: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

4

Topics

!  Data warehousing and the evolution of ETL processing

!  New data and new analytical workloads

!  Big data use cases driving business agendas

!  The unprecedented demand for customer insight

!  Challenges with new big data sources

!  Beyond the data warehouse – new platforms for new analytical workloads

!  The role of Hadoop in the modern analytical ecosystem

!  Introducing the Hadoop enterprise data hub and data refinery

!  Simplifying access to new big data insight using SQL on Hadoop

!  Integrating Hadoop into your analytical ecosystem

Page 5: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

5

For Many Years The Traditional Data Warehouse and BI Environment Has Been Used For Analysis & Reporting

Operational systems

web

P o r t a l

Employees Partners

Customers

BI Tools

Platform Dat

a In

tegr

atio

n / D

Q

Reports & analytics

Data warehouse & data marts

DW

Page 6: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

6

The Evolution of Data Integration in Data Warehousing – From Hand Coded to ETL to ELT

Hand coded ETL programs

DW Hand coded

programs

ETL Servers

DW ETL

Servers

ELT processing

Generated SQL ELT

processing

DW Evolution of Data Warehousing

MPP RDBMS systems

Page 7: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

7

Sales

Product line n

Product line 4

Product line 3

Product line 2

Product/service line 1

Marketing

Service

Credit Verification

HR

Finance

Planning

Procurement

Sup

ply

Cha

in

Sup

plie

rs

Front Office BackOffice

Operations

Cus

tom

ers

New Data Sources Have Emerged Inside And Outside The Enterprise That Business Now Wants To Analyse

E.g. RFID tag

sensor networks

weather data Data volume Data variety Number of sources

Data volume Data velocity

Page 8: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

8

Popular Big Data Analytic Applications – Web Data

!  Clickstream analytics •  Site navigation behaviour (session) analysis

–  Paths to buy, paths to abandonment, what else they looked at

–  Improve customer experience and conversion –  Associate clicks with customers & prospects

!  Social network influencer analysis •  Graph analytics for influencer behavioural impact

analysis •  ‘Target the influencer’ marketing campaign

effectiveness

Page 9: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

9

Popular Big Data Analytic Applications – Sensor Data For Improving Process Efficiency and Optimisation

!  Sustainability analytics e.g. energy optimisation !  Supply/distribution chain optimisation !  Asset management and field service optimisation ! Manufacturing production line optimisation !  Location based advertising (mobile phones) ! Grid health monitoring

•  Electricity, water, mobile phone cell network…

!  Smart metering (collect data every 15 minutes) !  Fraud !  Healthcare – ITC vital signs, fit bits,…. !  Traffic optimisation " WHAT ARE YOU PREPARED TO INSTRUMENT?

E.g. RFID tag

Page 10: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

10

Popular Big Data Analytic Applications – Unstructured Data

!  Case management

!  Fault management and field service optimisation

!  “Voice of the customer”

!  Sentiment analytics

!  Competitor analysis

! Media coverage analysis

!  Improve pharma drug trials

" Unstructured content is hard to analyse

How much is TEXT worth to your business?

Page 11: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

11

Big Data Analytics - Industry Use Case Examples

Industry Use Case Examples Financial Services

Improved risk decisions, KYC customer insight, auto programmatic trading, 360 view of financial crime, pre-trade decision support, real-time trade & corp action tagging for compliance and RT P&L, grow security services outsourcing, Reference Data Exchange

Utilities Smart meter data analysis, pricing elasticity analysis, customer loyalty, sustainability, asset management

Telecommunications

Customer Churn, Network optimization analysis from device, sensor and GPS inputs, monetization of GPS and data

Manufacturing Sensor data for next generation ‘smart’ products, production line optimisation, improved customer service and improved field service, distribution chain optimization, asset management

Insurance “How you drive” insurance (sensors to reduce risk), broker document analysis (risk assessment)

Government Smart cities (e.g. transportation optimisation), anti-terrorism, law enforcement

Logistics Distribution optimisation, route optimisation,

Page 12: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

12

More Data Is Required To Get A Deeper Understanding of Customers

! We now need •  Transaction data •  Data from touch points you own •  Data from the touch points you don’t own •  Interaction data

–  Need to look at Inbound interactions Vs outbound interactions –  Social interactions

•  Master data •  Professional data e.g. profiles on LinkedIn •  Internal and external event data •  Competition data…..

!  Then use analytics to understand and predictive desire and propensity e.g. propensity to churn

Page 13: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

13

Top Priorities - Improving Customer Experience Via Time Series Analysis of All Customer Interactions

OMNI channel – analyse all customer interactions across all channels

identity data

behavioural data

social data

Customer “DNA”

Page 14: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

14 identity

data behavioural data

social data

Customer “DNA”

Customer Experience Management - Understanding Customer On-Line Behaviour is Mission Critical to Retention and Growth

!  Important new data sources for analysis for customer ‘DNA’ •  Clickstream data from web logs •  Sentiment and social network influencer data

New competitors

More choice

Voice of the customer

On the web the customer is king

On the move

Easy to find

Page 15: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

15

Today Both Structured And Multi-Structured Data Are Needed For Deeper Insight

Multi-structured

data Click stream web log data Customer interaction data

Social interaction data Sensor data

Rich media data (video, audio) External content

Documents Internal web content

Seismic data (oil & gas)

Structured data

OLTP system data Data warehouse data

Personal data stores e.g. Excel, Access

Often un-modelled and may not be well understood

Often a schema is defined and data is well understood

Data characteristics are changing - Companies must deal with volume,

variety and velocity

Page 16: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

16

Big Data Analytics Challenges Include The Analysis of Unstructured, Semi-structured and Structured Data

{ "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] } JSON data

Text data

Image Data

Makes analysis more complex with new analytics and visualisations needed

Page 17: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

17

Increased Data and Analytical Complexity Has Created A Need For A New Role – The Data Scientist

Image source: Wikipedia

Data Science is the process of investigative / exploratory analysis of multi-structured data to discover and produce new business insights

Image source: www.computing.co.uk

Page 18: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

18

People In Different Roles In The Analytical Landscape Need To Work Together To Deliver Business Value

Exploratory analysis Predictive / statistical model producer

Business Analyst

Business Manager / Operations worker /

Customer Data Scientist

Model consumer Data visualisation Information Producer

• Build reports • Build and publish dashboards

Information consumer Decision maker Action taker

Strategic Business Objective

Priority KPI

Current KPI

Value

What is +1%

worth?

KPI Target

Executive Accountable

Business Initiatives (projects)

Budget Allocation

Action Plan

1 $$$ Project Project Project

£ x Million

2

3

4

Business Strategy – strategic objectives and targets including sustainability targets

sandbox

Page 19: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

19

Data Science Produces New Insights For Business Analysts Who Produce Actionable BI For Front Office Decision Makers

Business Analyst Marketing Manager / Marketing, Sales and

Service workers Data Scientist

Data Quality

Forecasting

Segmentation

Models

Customer Lifetime Value

Social Network

Strategy Creation

Performance & Effectiveness

Reporting

Direct Mail

Understand Customer Behavior

& Navigation

Marketing Performance &

Reporting

Campaign Planning

Financial Planning

Creative Materials

Marketing Attribution

Operations Management

Channel Efficiency

Sentiment & Influence

Dynamic Content

Re-marketing

Web

Call Center

Live Event

Broadcast Media

Mobile/ SMS

Social

Email

Industry Specific

Big Data Analytics Traditional DW/BI

Workflow & Approvals

New insights Actionable BI

Page 20: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

20

Big Data Analytics Has Taken Us Beyond The Traditional DW – New Big Data Analytical Workloads

1.  Analysis of data in motion

2.  Complex analysis of structured data

3.  Exploratory analysis of un-modeled multi-structured data

4.  Graph analysis e.g. social networks

5.  Accelerating ETL and analytical processing of un-modeled data to enrich data in a data warehouse or analytical appliance

6.  The storage and re-processing of archived data

Page 21: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

21

The Changing Landscape – We Now Have Different Platforms Optimised For Different Analytical Workloads Big Data workloads result in multiple platforms now being needed for analytical processing

Streaming data

Hadoop data store

Data Warehouse RDBMS

NoSQL DBMS

EDW

DW & marts

NoSQL DB e.g. graph DB

Advanced Analytic (multi-structured data)

mart DW

Appliance

Advanced Analytics (structured data)

Analytical RDBMS

Graph analysis

Investigative analysis,

Data refinery

Traditional query,

reporting & analysis

Real-time stream

processing & decision m’gmt

Data mining, model

development

Page 22: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

22

Hadoop Is A Key Platform In Big Data Analytics – Data Can Be Accessed Via Multiple APIs

Java MapReduce APIs to HDFS, HBase, Cascading

file file file file file

file file file file file

file file

file file

webHDFS (An HTTP interface to HDFS has

REST APIs) HDFS

file

file

file

file

YARN

PIG latin scripts

SQL

Vendor SQL on Hadoop engine

MapReduce Application

index index Index partition

SQL

BI Tools & Applications

Storm

Application

YARN

Tez or Spark MapReduce HBase

HDFS API

Page 23: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

23

Defacto Standard APIs Allow Hadoop Components To Be Replaced e.g. Faster, More Secure File System Than HDFS

Java MapReduce APIs to HDFS, HBase, Cascading

webHDFS (An HTTP interface to HDFS has

REST APIs) file file file file file

file file file file file

file file

file file

file

file

file

file Vendor Specific File System (e.g. )

YARN

HDFS API

PIG latin scripts

index index Index partition

Storm

Application

YARN

MapReduce HBase

MapReduce Application

SQL

Vendor SQL on Hadoop engine

SQL

BI Tools & Applications

Tez or Spark

Page 24: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

24

Apache Hadoop Components Component Description

Hadoop HDFS A distributed file system that partitions files across multiple machines for high-throughput access to application data – HDFS API allows vendors to replace HDFS with an alternative

Hadoop YARN" A framework for job scheduling and cluster resource management"Hadoop MapReduce

A programming framework for distributed batch processing of large data sets distributed across multiple servers

Avro A serialization system that creates & reads files in a format containing both JSON data definitions & the data itself for dynamic interpretation of the data by applications

Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs

HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable.

Pig A high-level data-flow language for expressing Map/Reduce programs for processing and analysing large HDFS distributed data sets

Mahout A scalable machine learning and data mining library

Oozie A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs)

Spark A general purpose engine for large scale data processing in-memory. It supports analytical applications that wish to make use of stream processing, SQL access to columnar data and analytics on distributed in-memory data

Zookeeper A high-performance coordination service for distributed applications

Page 25: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

25

The Role of Hadoop - Data Is Arriving Faster Than We Can Consume It – How Good Is Your Filter?

F D I A L T T A E R

Enterprise

Enterprise systems

Page 26: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

26

New Requirement – The Managed Hadoop Enterprise Data Hub

Parse & Prepare Data in Hadoop (MapReduce)

Transform & Cleanse Data in Hadoop (MapReduce)

Discover data in Hadoop

ELT work -flow

sandbox

other data

sandbox sandbox

Data Reservoir (raw data)

Load data into Hadoop

Data Refinery

New high value Insights

(pub/sub)

EDW Graph DBMS

DW appliance contains clean,

high value data

XML,%JSON%Web

logs

Page 27: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

27

What’s In An Enterprise Data Hub?

!  A managed data reservoir (raw data) •  Organised capture of multi-structured data •  Includes real-time data capture •  May include operational reporting

!  A governed data refinery •  Data integration and cleansing at scale •  Analytical sandboxes to discover high value data

!  Published, protected and secure high value insights

!  Long-term storage of archived data from data warehouses

Page 28: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

28

file file file

file file file

file file

file file

file

file

Real-time Data Capture – E.g. MapR Allows Web Log Data To Be Directly Streamed/Stored in Hadoop

MapR Direct Access NFSs allows Web log files to be stored directly on

their Hadoop File System so that click stream is captured in real-time

MapR Distribution for Hadoop

Web Server

Direct Access NFS

web log file web log

file

# mount localhost:/mapr /mapr

HDFS

Web Server Web Server

Page 29: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

29

High Volume Data Capture - Column Family Databases !  Suitable for fast capture of large amounts of sparse, volatile data

•  Very fast capture and can hold vast amounts of data •  Billions of rows containing thousands or millions of columns

!  Provide column-centric storage and wide de-normalised big tables can also help simplify operational reporting if used with SQL-on-Hadoop e.g. SQL access to HBase

!  Allow you to •  Group together related columns into column families •  Design column families to optimize the most common queries •  Retrieve columnar data for multiple entities by iterating through a

column family •  Shard rows in a column family and distribute across many servers •  Create indexes and secondary indexes •  Support schema variance - columns in a column family can vary for

every row

Page 30: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

30

NoSQL Column Family Databases - HBase

Row 1 # Column A = value Column B = value Column C = value

Row 2 # Column X = value Column Y = value Column Z = value

Hbase Storage Architecture

Hmaster and several HRegionServers

Regions (partitions) created automatically as tables grow Hbase allows applications to directly read and write data

Page 31: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

31

Column Families Can Be Stored In Different Files And Queries Will Only Retrieve The Column Family Needed

Source: Data Access for Highly-Scalable Solutions : Using SQL, NoSQL, and Polyglot Persistence, McMurtry, Oakley, Sharp, Subramanian, Zhang

Portfolio.* means all columns in the Portfolio column family

Data about a customer and their stock purchases are partitioned vertically by column family

Column family data can also be compressed

Page 32: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

32

Fast Data Capture – MapR-DB Is A High Speed Version of HBase Built Into The MapR Data Platform

HBase API

Source: MapR

Page 33: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

33

Enterprise Data Hub – We Need A Data Refinery To Process And Clean Complex Data

Image source: http://www.hollyfrontier.com/navajo/

Page 34: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

34

Evolution of Big Data Integration Is Following The Same Cycle as it Did in Data Warehousing

Hand coded ETL programs

Hadoop Hand coded

programs

ETL Servers

Hadoop ETL

Servers

ELT processing

Generated MapReduce ELT

processing

Hadoop Evolution of Big Data Integration

Page 35: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

35

Data Cleansing and Integration Tool

Scaling ETL In A Data Refinery By Generating Pig, Hive or 3GL MapReduce Code for In-Hadoop ELT Processing

Extract Parse Clean Transform Analyse Load Insights

Option 1 ETL tool generates HQL or convert generated SQL to HQL

Option 2 ETL tool generates Pig Latin (compiler converts every transform to a map reduce job)

Note - Generating native MapReduce code instead of HiveQL or Pig Latin would likely perform faster because there is no need to translate into MapReduce Also HiveQL is a subset of SQL so check how ETL tools generating HiveQL do complex transformations – HiveQL on its own may not be enough e.g. Hive UDFs?

Option 3 ETL tool generates 3GL MapReduce code

Page 36: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

36

Need to Parse & Extract From Multi-Structured Data While Integrating Data In A Big Data Environment

E-mail (semi-structured)

Text (unstructured)

Extract Parse Transform Load …

Page 37: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

37

Sandboxes In The Data Refinery - Data Science Teams Need To Conduct Exploratory Analysis on Multi-Structured Data

Click stream web log data Customer interaction data

Social interaction data (e.g. Twitter, Facebook)

Sensor data Rich media data (video, audio)

External web content Documents

Internal web content Seismic data (oil & gas)

Investigative / Exploratory Analysis

C

R U

D

Asset Customer

Product

MDM System

EDW mart

new business insights

sandbox

Multi-structured data

Historical Data

archived DW data master data

Data Scientists

Page 38: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

38

In-Hadoop Analytics In A Data Refinery – Example Technologies

!  Hadoop MapReduce, Tez or Spark analytic applications with custom analytics •  Pig, Java, Python, Scala, Cascading…..

!  Hadoop MapReduce, Tez or Spark analytic applications using pre-built Hadoop analytics e.g. Mahout, Spark MLlib •  Several analytical algorithms for use in analysis

!  Revolution Analytics RevoScaleR

!  SAS Analytics and In-Memory Statistics for Hadoop

!  … many more

Analytical tools

Data management

tools

Page 39: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

39

In-Hadoop Analytics: - Mahout Supports A Number Of Analytic Techniques

!  Collaborative Filtering

!  User and Item based recommenders

!  K-Means and Fuzzy K-Means clustering

! Mean Shift clustering

!  Dirichlet process clustering

!  Latent Dirichlet Allocation

!  Singular value decomposition

!  Parallel Frequent Pattern mining

!  Complementary Naive Bayes classifier

!  Random forest decision tree based classifier

https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Now runs on Spark as

well as MapReduce

Page 40: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

40

Expediting The Data Refinery Process On Hadoop With Automated Analysis – From ETL to Analytical Workflows

Parse & Prepare Data in Hadoop (MapReduce)

Transform & Cleanse Data in Hadoop (MapReduce)

Discover data in Hadoop

ELT work -flow

other data

Raw data

Load data into Hadoop

Data Refinery

EDW Graph DBMS

DW appliance

Automated Invocation of Custom Built & Pre-built Analytics on Hadoop

contains clean, high value data

New high value Insights

(pub/sub)

Page 41: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

41

High Value Insights Produced In A Hadoop Data Hub Can Be Brought Into A DW to Enrich What We Already Know

Cloud Data

HDFS

Extract

DW D I Map/ Reduce data

transformation and analytics applications

Transform

e.g. PIG, IBM JAQL

Cloud Data e.g. Deriving insight from huge volumes of social web content on sites like twitter, facebook. Digg, mySpace, tripAdvisor, Linkedin….for sentiment analytics

Hundreds of terabytes up to petabytes

new insights

Operational systems

Page 42: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

42

Making New Insights Available To Business Analysts Via SQL Access To Big Data - Options

SQL

SQL access to big data in Hadoop

SQL

DW

data virtualisation server

SQL access to big data via data

virtualisation

SQL

Analytical RDBMS

SQL access to big data in an

analytical RDBMS

streaming data

SQL

SQL access to streaming data in

motion

Page 43: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

43

Self-Service BI

BI Tool(s) e.g, Visual Discovery tools

Business Analyst or ‘budding’ Data

Scientist

personal & office data

Predictive models

community

Publish / Share Consume / Enhance / Re-publish

Transaction systems

DW

SQL Access to Hadoop Is Needed To Allow Hadoop Data To Be Accessed By Users With Self-Service BI Tools

collaborate

HDFS / Hbase/ Hive

e.g. Hive interface

Page 44: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

44

SQL access to Big Data?

Key Questions That May Influence If SQL Access to Big Data Is A Good Choice or What SQL Option to Take

What kind of analysis? Text analysis, Graph analysis, Machine Learning, reporting

What kind of data type(s) do you need to analyse? - structured, unstructured, semi-

structured,

What kind of data volumes do you want to analyse?

Is the data at rest or is it real-time streaming data in motion?

What analytical functions can you invoke on big

data from SQL?

Join with other data in another data store?

How many concurrent users?

Performance and scalability of complex queries and

analytical functions (need parallelism)

Is the requirement for interactive, exploratory, or real-time analysis?

Data

Analytical Workload

Page 45: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

45

SQL On Hadoop Initiatives

Key Questions What analytic functions are provided? How can analytic functions be extended Can you join to data outside of Hadoop? Are these SQL on Hadoop options suitable for reporting and analysis, interactive discovery, exploratory analysis or all of these?

Vendor SQL on Hadoop Initiative AMPlab (UC Berkeley) Shark (Forked Hive at V0.9) or SparkSQL

Apache Hadoop Hive

Actian Vortex (Actian Vector on Hadoop data nodes)

CitusDB CitusDB (uses external tables)

Cloudera Impala / Parquet

Concurrent Lingual (SQL on Cascading)

Hadapt Schemaless SQL

Hortonworks Stinger / ORC (Hive 13)

HP Vertica on Hadoop

IBM BigSQL (SQL on HDFS & HBase)

InfiniDB InfiniDB on Apache Hadoop

Jethro Data JethroData

MapR Apache Drill

Microsoft Hive 13

Pivotal HawQ (uses external tables via PFX)

Teradata SQL-H

Splice Machine Splice Machine (SQL Engine on HBase)

Salesforce.com Phoenix (SQL engine on HBase)

Attivio Active Intelligence Engine (SQL access to search indexes on Hadoop data)

Page 46: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

46

SQL on Hadoop – Apache Drill Can Access HDFS And HBase Data

BI Tool(s) e.g, Visual Discovery tools

Business Analyst or’ Data Scientist

Drill

Analytic Application

SQL SQL

Data Scientist

HDFS HBase

MapR Distribution for Hadoop

Apache Drill does not use MapReduce

MongoDB/ Cassandra

sensors

XML,%JSON%

Data entering HBase

Page 47: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

47

Apache Drill Distributed Query Processing – A Storage Independent Drillbit MPP Architecture

Each drillbit is capable of receiving queries from applications and BI tools - there is no master in this architecture Multiple drillbits are involved in parallel query processing on distributed data

Supports Apache HDFS, Apache HBase, MapR-FS, MapR-DB, Amazon S3

Page 48: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

48

SQL on Hadoop Example – Apache Drill Supports Query of Self-Describing Data Without a Schema

JSON

Source: MapR

Page 49: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

49

file

file

file

file

file

file file

file

file

file

file

SQL on Hadoop – What Should The Schema Look Like?

Star schema? Snowflake schema?

De-normalised schema?

Other?

Page 50: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

50

Hadoop Storage Is Independent of Any SQL Engine Accessing HDFS - Multiple SQL Engines Can Coexist On The Same Data

file file file file file

file file file file file

file file

file file

HDFS file

file

file

file

YARN

Batch (MapReduce)

Interactive (Tez)

On-line (HBase)

Streaming (Storm,..)

Graph (Giraph)

In-memory (Spark)

HPC MPI (OpenMPI)

Other (Search,.)

file

file

file

file

SQL SQL SQL SQL

Storage is independent of any SQL engine !  Key points about Hadoop

•  It is possible to have MULTIPLE SQL engines on the same data •  Different SQL engines run on different Hadoop frameworks (M/R, Tez,

Spark) or on no framework at all i.e. directly access HDFS or HBase data

Page 51: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

51

Relational DBMS / Hadoop Integration – Several Vendors Have Integrated RDBMS with Hadoop to Run Analytics

Relational DBMS

External Polymorphic

table function(s)

HDFS / Hbase/ Hive

SQL, XQuery

RDBMS optimizer handles transparent access to external analytical platforms on behalf of the user

RDBMS and Hadoop could be deployed on the same hardware cluster (preferred) or on different hardware clusters

Allows join across data in a single RDBMS and Hadoop

Page 52: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

52

Relational DBMS / Hadoop Integration Example - HP Vertica and MapR

Source: MapR

Page 53: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

53

Self-Service BI

Self-service Data Discovery & Visualisation

or Dashboard Server

Business analyst

Data Virtualization and Optimization

personal & office

data Predictive models

Transaction systems

Data Management Tools (ETL, DQ, etc.)

DW

Self-Service Access To Big Data Via Data Virtualization

BUT what about optimization? Can the data virtualisation server push down analytics to underlying platforms to make them do the work?

Page 54: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

54

New Insights Can Be Added Into A Data Warehouse To Enrich What You Already Know

DW D I

new insights

Operational systems

e.g. Deriving insight from social web sites like for sentiment analytics

sandbox

Data Scientists

social

Web logs

web cloud ELT

Page 55: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

55

Alternatively New Insights In Hadoop Can Integrated With A DW Using Data Virtualization To Provide Enriched Information

DW D I

e.g. Deriving insight from social web sites like for sentiment analytics

new insights

OLTP systems

sandbox

Data Scientists

social

Web logs

web cloud

Data Vitualisation

SQL on Hadoop

Page 56: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

56

Using Hadoop As A Data Archive Means Data Can Be Kept On-line, Analysed And Still Integrated With Data In The DW

DW D I

OLTP systems

Data Vitualisation

SQL on Hadoop

Archived data

Archive unused

or data > n years

Page 57: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

57

SQL on Hadoop

Big Data Governance – Data Sources, Sandboxes, People, Data Access Security, Results Lineage….

Graph DBMS

MPP Analytical RDBMS

Social graph data Unstructured / semi-

structured content

DW

RDBMS Files clickstream%

Web logs

governance

governance

governance

governance

governance

governance

governance governance governance

Page 58: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

58

Issues: Siloed Analytics - Different Tools to Manage and Integrate Data For Each Type of Analytical Data Store

Analytical tools

Data management

tools

EDW mart

Structured data

CRM ERP SCM

Silo

DW & marts

Streaming data (markets, sensors

Analytical models

Silo

Analytical tools/apps

Data management

tools

Multi-structured data

Silo

DW Appliance

Advanced Analytics (structured data)

Data management

tools

Structured data

CRM ERP SCM

Analytical tools

Silo

Analytical tools/apps

Data management

tools

NoSQL DB e.g. graph DB

Silo

Multi-structured & structured data

Page 59: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

59

EDW

MDM System DW & marts

NoSQL DB e.g. graph DB

Advanced Analytic (multi-structured data)

mart DW

Appliance

Advanced Analytics (structured data)

Need to Manage The Supply of Consistent Data Across The Entire Analytical Ecosystem

Common Enterprise Information Management Tool Suite Stream

processing

C

R

U

D

Prod

Asset

Cust

actions

feeds sensors

XML,%JSON%

RDBMS Files office docs social Cloud clickstream%

Web logs web services

New

New

New

New

New New New New New New

New

New

C

R

U

D

Prod

Asset

Cust

New data types need to be supported by EIM tool suites

Page 60: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

60

BI tools platform & data visualisation

tools

Search based

BI tools

Custom MapReduce applications

Map Reduce BI tools

Graph Analytics

tools

A New Architecture for Analytics - The Intelligent Business Strategies Extended Analytical Ecosystem

Enterprise Information Management Tool Suite

feeds sensors

XML,%JSON%

RDBMS Files office docs social Cloud clickstream%Web logs web services

Event processing

C

R

U

D

Prod

Asset

Cust

EDW

MDM System DW & marts

NoSQL DB e.g. graph DB

Advanced Analytics (multi-structured data)

mart DW

Appliance

Advanced Analytics (structured data)

actions

Filtered

data

Data Virtualisation and optimization

Page 61: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

61

Conclusions

!  Business demand for new more complex, high volume data is driving the need for new analytical workloads beyond the data warehouse

!  Hadoop is a low cost analytical platform capable of supporting new analytical workloads on multi-stuctured data

!  A key role for Hadoop is as an data hub and data refinery

!  The data refinery process requires data integration and cleansing to scale to handle the volume, variety and velocity of complex multi-structured data

!  Data scientists analyse big data as part of the data refining process to produce new insights that can be added to what you already know

!  Hadoop is part of an extended analytical ecosystem with data management tools supplying consistent data across all data stores

!  Data scientists, business analysts and information consumers need to work together to deliver new insight for competitive advantage

Page 62: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 62 © 2014 MapR Technologies ®

Best Practices for Production Success

Page 63: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 63

HQ

WORLDWIDE HADOOP TECHNOLOGY LEADER UNIQUELY ADDRESSES BOTH ANALYTIC AND OPERATIONAL USE CASES 500+ PAYING CUSTOMERS

MapR:

Page 64: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 64

MapR: Best Product for Customer Success

Top Ranked Exponential Growth 500+ Customers

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenue generated by 1 customer

Page 65: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 65

FOUNDATION

Architecture Matters for Success

Page 66: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 66

FOUNDATION

High Availability & Data Protection

High performance

Multi-tenancy

Operational & analytical workloads

Open standards for integration

NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO

Architecture Matters for Success

Page 67: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 67

The Power of the Open Source Community M

anag

emen

t

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Tez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeper Sqoop

Knox* Whirr Falcon* Flume

Data Integration & Access

HttpFS

Hue

*%Cer6fica6on/support%planned%for%2014%

Page 68: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 68

MapR Distribution for Hadoop M

anag

emen

t

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance

Tez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeper Sqoop

Knox* Whirr Falcon* Flume

Data Integration & Access

HttpFS

Hue

*%Cer6fica6on/support%planned%for%2014%

•  High availability •  Data protection •  Disaster recovery

•  Standard file access

•  Standard database access

•  Pluggable services •  Broad developer

support

•  Enterprise security authorization

• Wire-level authentication

•  Data governance

•  Ability to support predictive analytics, real-time database operations, and support high arrival rate data

•  Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators

•  2X to 7X higher performance

•  Consistent, low latency

Enterprise-grade Security Operational Performance Multi-tenancy Interoperability

Page 69: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 69

Hadoop + Data Warehouse Architecture Improve data services to customers without increasing enterprise architecture costs

•  Provide cloud, security, managed services, data center, & comms •  Report on customer usage, profiles, billing, and sales metrics •  Improve service: Measure service quality and repair metrics

•  Reduce customer churn – identify and address IP network hotspots •  Cost of ETL & DW storage for growing IP and clickstream data; >3

months •  Reliability & cost of Hadoop alternatives limited ETL & storage offload

•  MapR for data staging, ETL, and storage at 1/10th the cost •  MapR provided smallest datacenter footprint with best DR solution •  Enterprise-grade: NFS file management, consistent snapshots & mirroring •  Data warehouse for mission-critical reporting and analysis

OBJECTIVES

CHALLENGES

SOLUTION

Hadoop + Data Warehouse = New, Deeper Insights for the Business •  Increased scale to handle network IP and clickstream data •  Freed up processing on DW to maintain reporting SLA’s to business •  Unlocked new insights into network usage and customer preferences

Business Impact

FORTUNE 500 TELCO

Page 70: MapR Enterprise Data Hub Webinar w/ Mike Ferguson

®© 2014 MapR Technologies 70

Q & A Engage with us!

@mikeferguson1 – Intelligent Business Strategies @swooledge – MapR Technologies

•  Learn more about Hadoop in your architecture: www.mapr.com/EDH

•  Upcoming Webinar series - www.mapr.com/resources/webinars –  6/26 Talend – ETL in/for Hadoop –  7/09 Syncsort – comScore & mainframe optimization –  7/17 Rick van der Lans – SQL-on-Hadoop –  7/23 Skytree – machine learning & analytics –  7/30 Appfluent – DW usage monitoring & optimization –  8/14 Tableau – data exploration & analysis on Hadoop

•  Contact / follow us