43
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Building A Modern Data Architecture (MDA) Using Enterprise Hadoop Slim Baltagi, Systems Architect Hortonworks Inc. Open-BDA Hadoop Summit 2014 November 18 th , 2014

Building a Modern Data Architecture with Enterprise Hadoop

Embed Size (px)

Citation preview

Page 1: Building a Modern Data Architecture with Enterprise Hadoop

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Building A Modern Data Architecture (MDA) Using Enterprise Hadoop

Slim Baltagi, Systems Architect Hortonworks Inc.

Open-BDA Hadoop Summit 2014 November 18th, 2014

Page 2: Building a Modern Data Architecture with Enterprise Hadoop

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Your Presenter

Slim Baltagi

•  Currently a Systems Architect in the Professional Services Organization of Hortonworks in the central region (US and Canada).

•  Over 4 years of Hadoop experience working on 9 Big Data projects.

•  Slim has over 16 years of IT experience working in various architecture, design, development and consulting roles.

•  Slim Baltagi holds a master’s degree in Mathematics and is an ABD in computer science from Université Laval, Québec, Canada.

•  Twitter: @SlimBaltagi

Page 3: Building a Modern Data Architecture with Enterprise Hadoop

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 3

Page 4: Building a Modern Data Architecture with Enterprise Hadoop

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Traditional Data Architecture Under Pressure AP

PLICAT

IONS  

DATA

   SYSTEM  

SOURC

ES  

Business    Analy:cs  

Custom  Applica:ons  

Packaged  Applica:ons  

Exis:ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

SILO  SILO  

RDBMS  

SILO   SILO  SILO   SILO  

EDW   MPP  

Data  growth:  New  Data  Types  

OLTP,  ERP,  CRM  Systems  

Unstructured  docs,  emails  

Clickstream  

Server  logs  

Social/Web  Data  

Sensor.  Machine  Data  

Geoloca:on  

85% Source: IDC

??

"   Can’t manage new data paradigm

"   Constrains data to specific schema

"   Siloed data

"   Limited scalability

"   Economically unfeasible

"   Limited analytics

Page 5: Building a Modern Data Architecture with Enterprise Hadoop

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A Modern Data Architecture for New Data AP

PLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

Exis:ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

RDBMS   EDW   MPP  

Business  Analy:cs  

Custom  Applica:ons  Packaged  Applica:ons  

OLTP,  ERP,  CRM  Systems  

Unstructured  documents,  emails  

Clickstream  

Server  logs  

Sen>ment,  Web  Data  

Sensor.  Machine  Data  

Geoloca>on  

New Data Requirements:

•  Scale •  Economics •  Flexibility

Traditional Data Architecture

Page 6: Building a Modern Data Architecture with Enterprise Hadoop

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Goals for the Modern Data Architecture

ü  Centrally manage new and existing data

ü  Provide single view of the customer, product, supply chain

ü  Run batch, interactive & real time analytic applications on shared datasets

ü  Assure enterprise-grade security, operations and governance

ü  Leverage new and existing data center infrastructure investments

ü  Scalable and affordable; low cost per TB

ü  Deployment flexibility

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web  &  Social  

Geoloca:on   Sensor  &  Machine  

Server    Logs  

Unstructured  

Page 7: Building a Modern Data Architecture with Enterprise Hadoop

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

1. Drivers for a Modern Data Architecture (MDA)

•  Semi-Structured and Unstructured – NEW DATA Unstructured documents, emails, Sentiment, Web Data, Sensor, Machine Data, Geolocation, ...

•  Enterprise Data Warehouse Optimization – REDUCED COSTS

Low-value computing tasks such as ETL consume significant EDW resources. When offloaded to Hadoop, these ETL processes can be performed much more efficiently, freeing up your data warehouse to perform high-value functions like analytics and operations.

Page 8: Building a Modern Data Architecture with Enterprise Hadoop

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

1. Drivers for a Modern Data Architecture (Continued) •  Advanced Analytics – NEW ANALYTICS APPS

Unlike schema-on-write, which transforms data into specified schema upon load, Hadoop empowers you to store data in any format, and then create schema at that moment when you choose to analyze your data. This unprecedented flexibility opens up new possibilities for iterative analytics and delivers new business value.

•  Single Cluster, Multiple Workloads – ANY WORKLOAD

With Apache Hadoop YARN supporting multiple access methods (such as batch, interactive, streaming and real-time) on a common data set, Hadoop enables you to transform and view data in multiple ways simultaneously, dramatically reducing time to insight.

Page 9: Building a Modern Data Architecture with Enterprise Hadoop

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 9

Page 10: Building a Modern Data Architecture with Enterprise Hadoop

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

2. What’s a Modern Data Architecture (MDA)? •  Apache Hadoop is a core component of a Modern Data Architecture,

allowing organizations to collect, store, analyze and manipulate massive quantities of data on their own terms—regardless of the source of that data, how old it is, where it is stored, or under what format.

•  The Hortonworks Data Platform (HDP) delivers Enterprise Apache Hadoop, deeply integrated with existing systems to create a highly efficient, highly scalable way to manage all your enterprise data.

•  Integrate new & existing data sets, with existing tools & skills.

•  Make all data available for shared access and processing in multitenant infrastructure

•  Batch, interactive & real-time use cases

Page 11: Building a Modern Data Architecture with Enterprise Hadoop

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4. Hadoop’s role in an MDA

Page 12: Building a Modern Data Architecture with Enterprise Hadoop

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

3. What’s a Modern Data Architecture (MDA)?

Page 13: Building a Modern Data Architecture with Enterprise Hadoop

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 14: Building a Modern Data Architecture with Enterprise Hadoop

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 15: Building a Modern Data Architecture with Enterprise Hadoop

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 15

Page 16: Building a Modern Data Architecture with Enterprise Hadoop

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Key Drivers of Hadoop

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

RDBMS   EDW   MPP  

APPLICAT

IONS  

Business    Analy:cs  

Custom  Applica:ons  

Packaged  Applica:ons  

Unlock  New  Approach  to  Analy:cs  •  Agile  analy>cs  via  “Schema  on  Read”  with  ability  to  store  all  data  in  na>ve  format  

•  Create  new  apps  from  new  types  of  data  A

Op:mize  Investments,  Cut  Costs  •  Focus  EDW  on  high  value  workloads  •  Use  commodity  servers  &  storage  to  enable  all  data  (original  and  historical)  to  be  accessible  for  ongoing  explora>on  

B Enable  a  Modern  Data  Architecture  •  Integrate  new  &  exis>ng  data  sets  •  Make  all  data  available  for  shared  access  and  processing  in  mul>tenant  infrastructure  

•  Batch,  interac>ve  &  real-­‐>me  use  cases  •  Integrated  with  exis>ng  tools  &  skills  

C EXISTING  Systems  

Clickstream   Web  &  Social  

Geoloca:on   Sensor  &  Machine  

Server    Logs  

Unstructured  

YARN: Data Operating System

° ° ° ° ° ° ° ° °

Interactive Real-Time Batch

HDFS: Hadoop Distributed File System

Page 17: Building a Modern Data Architecture with Enterprise Hadoop

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop: It’s About Scale & Structure

Hadoop

schema

governance

best fit use

processing

Required on write Required on read

Standards and structured Multiple Structures

Limited, no data processing Processing coupled with data

data types Structured Multi and unstructured

Complex ACID Transactions Operational Data Store

Data Discovery Processing unstructured data

Interactive Analytics

Traditional RDBMS SCALE

(storage & processing)

transactions Optimized, reliable Optimized for analytics

Page 18: Building a Modern Data Architecture with Enterprise Hadoop

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN and HDP Enables the Modern Data Architecture YARN is the architectural center of Hadoop and HDP •  YARN enables a common data set

across all applications

•  Batch, interactive & real-time workloads

•  Support multi-tenant access & processing

HDP enables Apache Hadoop to become Enterprise Viable Data Platform with centralized services •  Security

•  Governance

•  Operations

•  Productization

Enabled broad ecosystem adoption

Hortonworks drove this innovation of Hadoop through YARN

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

Others

ISV Engines

On-Premises

Page 19: Building a Modern Data Architecture with Enterprise Hadoop

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 20: Building a Modern Data Architecture with Enterprise Hadoop

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 20

Page 21: Building a Modern Data Architecture with Enterprise Hadoop

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

…to real-time personalization From static branding

…to repair before break From break then fix

…to designer medicine From mass treatment

…to automated algorithms From educated investing

…to 1x1 targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Manufacturing

Hadoop enables organizations to cost effectively store and use all of the data available in a way that shifts the business from…

Reactive

Proactive

Shift to Data-driven Means Treating Data like Capital

Page 22: Building a Modern Data Architecture with Enterprise Hadoop

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Create New Applications from New Types of Data INDUSTRY USE CASE Sentiment

& Web Clickstream & Behavior

Machine & Sensor Geographic Server Logs Structured &

Unstructured

Financial Services New Account Risk Screens ✔ ✔ Trading Risk ✔ ✔ Insurance Underwriting ✔ ✔ ✔

Telecom Call Detail Records (CDR) ✔ ✔ Infrastructure Investment ✔ ✔ Real-time Bandwidth Allocation ✔ ✔ ✔ ✔ ✔

Retail 360° View of the Customer ✔ ✔ ✔ Localized, Personalized Promotions ✔ Website Optimization ✔

Manufacturing Supply Chain and Logistics ✔ Assembly Line Quality Assurance ✔ Crowd-sourced Quality Assurance ✔

Healthcare Use Genomic Data in Medical Trials ✔ ✔ Monitor Patient Vitals in Real-Time ✔ ✔

Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔ Improve Prescription Adherence ✔ ✔ ✔ ✔

Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔ Monitor Rig Safety in Real-Time ✔ ✔ ✔

Government ETL Offload/Federal Budgetary Pressures ✔ ✔

Sentiment Analysis for Government Programs ✔

Page 23: Building a Modern Data Architecture with Enterprise Hadoop

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.1 Advertising

•  Mine Grocery & Drug Store POS Data to Identify High-Value Shoppers

•  Target Ads to Customers in Specific Cultural or Linguistic Segments

•  Syndicate Videos According to Behavior, Demographics & Channel

•  ETL Toy Market Research Data for Longer Retention & Deeper Insight

•  Optimize Online Ad Placement for Retail Websites

Page 24: Building a Modern Data Architecture with Enterprise Hadoop

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

5. Use Cases related to an MDA (Continued)

Page 25: Building a Modern Data Architecture with Enterprise Hadoop

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.2 Financial Services

•  Screen New Account Applications for Risk of Default

•  Monetize Anonymous Banking Data in Secondary Markets

•  Improve Underwriting Efficiency for Usage-Based Auto Insurance

•  Analyze Insurance Claims with a Shared Data Lake

•  Maintain Sub-Second SLAs with a Hadoop “Ticker Plant”

•  Surveillance of Trading Logs for Anti-Laundering Analysis

Page 26: Building a Modern Data Architecture with Enterprise Hadoop

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 27: Building a Modern Data Architecture with Enterprise Hadoop

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.3 Healthcare

•  Access Genomic Data for Medical Trials

•  Monitor Patient Vitals in Real-Time

•  Reduce Cardiac Re-Admittance Rates

•  Machine Learning to Screen for Autism with In-Home Testing

•  Store Medical Research Data Forever

•  Recruit Research Cohorts for Pharmaceutical Trials

•  Track Equipment and Medicines with RFID Data

•  Improve Prescription Adherence

Page 28: Building a Modern Data Architecture with Enterprise Hadoop

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 29: Building a Modern Data Architecture with Enterprise Hadoop

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.4 Manufacturing

•  Assure Just-In-Time Delivery of Raw Materials

•  Control Quality with Real-Time & Historical Assembly Line Data

•  Avoid Stoppages with Proactive Equipment Maintenance

•  Increase Yields in Drug Manufacturing

•  Crowdsource Quality Assurance

Page 30: Building a Modern Data Architecture with Enterprise Hadoop

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 31: Building a Modern Data Architecture with Enterprise Hadoop

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.5 Oil & Gas •  Slow Decline Curves with Production Parameter Optimization

•  Define Operational Set Points for Each Well & Receive Alerts on Deviations

•  Optimize Lease Bidding with Reliable Yield Predictions

•  Report on Compliance with Environmental , Health and Safety Regulations

•  Repair Equipment Preventatively with Targeted Maintenance

•  Integrate Exploration with Seismic Image Processing

Page 32: Building a Modern Data Architecture with Enterprise Hadoop

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 33: Building a Modern Data Architecture with Enterprise Hadoop

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.6 Public Sector

•  Understand Public Sentiment About Government Performance

•  Protect Critical Networks from Threats (Both Internal and External)

•  Prevent Fraud and Waste

•  Analyze Social Media to Identify Terrorist Threats

•  Decrease Budget Pressures by Offloading Expensive SQL Workloads

•  Crowdsource Reporting for Repairs to Roads and Public Infrastructure

•  Fulfill “Open Records” and Freedom of Information Requests

Page 34: Building a Modern Data Architecture with Enterprise Hadoop

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.7 Retail

•  Build a 360 degrees View of the Customer

•  Analyze Brand Sentiment

•  Localize & Personalize Promotions

•  Optimize Websites

•  Optimize Store Layouts

Page 35: Building a Modern Data Architecture with Enterprise Hadoop

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 36: Building a Modern Data Architecture with Enterprise Hadoop

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

4.8 Telecom

•  Analyze Call Detail Records (CDRs)

•  Service Equipment Proactively

•  Rationalize Infrastructure Investments

•  Recommend Next Product to Buy (NPTB)

•  Allocate Bandwidth in Real-time

•  Develop New Products

Page 37: Building a Modern Data Architecture with Enterprise Hadoop

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 38: Building a Modern Data Architecture with Enterprise Hadoop

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

What is a Data Lake?

• An architectural pattern in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale

§ But What is it? – It is a PLATFORM for your data. (It is not a database) – Multipurpose open PLATFORM to land all data in a single place and interact with it many

ways. § A platform that allows for the ecosystem to provide higher level services (SAS, SAP,

Microsoft, Streaming, MPP, In-memory, etc..) § Provides first class APIs and frameworks to enable this integration § Provides first class data management capabilities (metadata management, security,

transformation pipelines, replication, retention, etc..)

Page 38

Page 39: Building a Modern Data Architecture with Enterprise Hadoop

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Knox – Perimeter Level Security

compute &

storage . . .

. . .

. . compute

& storage

.

.

YARN

Data Lake HDP Grid

AMBARI

HDP Data Lake Reference Architecture

Page 39

HCATALOG (table & user-defined metadata)

Step 2: Model/Apply Metadata

Use Case Type 1: Materialize & Exchange

Opens up Hadoop to many new use cases

Stream Processing, Real-time Search,

MPI

YARN Apps

INTERACTIVE

Hive Server (Tez/Stinger)

Query/ Analytics/Reporting

Tools

Tableau/Excel

Datameer/Platfora/SAP Use Case Type 2: Explore/Visualize

FALCON (data pipeline & flow management)

Manage Steps 1-4: Data Management with Falcon

Oozie (Batch scheduler)

(data processing)

HIVE PIG Mahout Exchange

HBase Client

Sqoop/Hive

Downstream Data Sources

OLTP HBase

EDW (Teradata)

Storm SAS

SOLR

TEZ

Step 3: Transform, Aggregate & Materialize

MR2

Step 4: Schedule and Orchestrate

Ingestion

SQOOP

FLUME

Web HDFS

NFS

SOURCE DATA

ClickStream Data

Sales Transaction/Data

Product Data

Marketing/Inventory

Social Data

EDW

File

JMS

REST

HTTP

Streaming

STORM

Step 1:Extract & Load

Page 40: Building a Modern Data Architecture with Enterprise Hadoop

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 40

Page 41: Building a Modern Data Architecture with Enterprise Hadoop

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

5. Learn More …

Page 41

Resource Location

MDA White Paper http://info.hortonworks.com/data-lake-hadoop-whitepaper.html Learn more about Modern Data Architecture (MDA)

MDA Web Page http://hortonworks.com/hadoop-modern-data-architecture/ Explore Use Cases by Industry

Hortonworks Sandbox

http://hortonworks.com/products/hortonworks-sandbox/ Get Started on Hadoop with Hortonworks Sandbox

Hadoop Tutorials http://info.hortonworks.com/On-demand-Tutorials_Sign-Up-Page.html On-Demand Hadoop Tutorials Delivered to Your Inbox

Enterprise Data Lake

http://hortonworks.com/blog/enterprise-hadoop-journey-data-lake/ Enterprise Hadoop and the journey to Data Lake

Page 42: Building a Modern Data Architecture with Enterprise Hadoop

Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013

Outline

1. Drivers for an MDA

2. What’s an MDA

3. Hadoop’s role in an

MDA

4. Use Cases

related to an MDA

5. Learn More 6. Q&A

Page 42

Page 43: Building a Modern Data Architecture with Enterprise Hadoop

Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

6. Q&A…

Thank you!