28
Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical Marketing Manager Storage & Hyper-Converged Business Unit

Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

Managing Data Analytics in a Hybrid Cloud

Karan SinghSr. Solution ArchitectStorage & Hyper-Converged Business Unit

Daniel GilfixTechnical Marketing ManagerStorage & Hyper-Converged Business Unit

Page 2: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

AGENDA

2

● CUSTOMER PAIN

● COMMON APPROACHES

● SHARED DATA LAKES

● HOW IT WORKS AND WHERE

● SUMMARY AND NEXT STEPS

Page 3: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

CUSTOMER PAIN

Page 4: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED4

CUSTOMER PAIN POINTS

EXPLOSIVE GROWTHin data analytics teams and analytic tools

MULTIPLE TEAMS COMPETINGfor use of the samebig data resources.

CONGESTIONin busy analytic clusterscausing frustration and missed SLAs.

HADOOP

SPARKSQLSPARK

HIVEMAPREDUCE

PRESTOIMPALA

KAFKANIFI

ETC.

Page 5: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED5

RESULTING IN CUSTOMER CHOICES

Get a bigger clusterfor many teams to share.

Give each teamown dedicated cluster,

each with copy of PBs of data.

Give teams ability tospin-up/spin-downclusters which can

share common data store.

#1 #2 #3

Page 6: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED6

#3 ON-DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE

HIT SERVICE-LEVEL AGREEMENTSGive teams their owncompute clusters.

ELIMINATE IDLE RESOURCESBy right-sizing de-coupled compute and storage.

BUY 10s OF PBS INSTEAD OF 100S Share data sets across clusters instead of duplicating them.

INCREASE AGILITYWith spin-up/spin-down clusters.

Page 7: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED

Red Hat data analytics infrastructure solution Multi-tenant workload isolation with shared data context

BATCH JOBS(SLOW)

STREAMINGANALYTICS

INTERACTIVEANALYTICS

OTHERANALYTICS

BATCH JOBS(FAST)

DYNAMIC compute resources and clusters able to meet different SLAs

UNIFIED single object storage solution feeding analytics jobs

ELASTIC provisioning and release of compute resources required by various analytics jobs

Page 8: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

BENEFITS - AGILITY AND $$$

● Faster answers through elastic provisioning via OSP on shared data sets● Fewer roadblocks for empowered users in self-service data labs / clusters● Private/public cloud versatility with S3A interface● Reduced cost and risk from not duplicating and maintaining data sets● CapEx relief by scaling storage independent from compute

Page 9: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

HOW IT WORKS

Page 10: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED10

GENERATION - I : ANALYTICSMONOLITHIC HADOOP STACKS

Analytics vendors provide single-purpose infrastructure

Analytics vendors provideanalytics software

ANALYTICS +INFRASTRUCTURE

ANALYTICS +INFRASTRUCTURE

ANALYTICS +INFRASTRUCTURE

Page 11: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED11

GENERATION - II : ANALYTICSELASTIC COMPUTE AND SHARED STORAGE CLOUDS

Analytics vendors provideanalytics software

Red Hat providescloud infrastructure software

Provisioned Compute Poolvia OpenStack and OpenShift platforms

Shared Datasets on Red Hat Ceph Storage

Page 12: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED12

MULTIPLE ANALYTIC CLUSTERSSHARING DATA

INGEST ETL INTERACTIVEQUERY

BATCH QUERY& JOINS

ELASTIC COMPUTE RESOURCE POOL

Kafkacompute instances

Hive/Map Reducecompute instances

Prestocompute instances

Sparkcompute instances

SHARED DATA LAKE

Platinum SLA

Gold SLA

Silver SLA

Bronze SLA

Page 13: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED13

ANALYTIC WORKLOADS JOINING THE INFRA

storage silo

bare metal silo virtualization infra

shared storage SAN

Red Hat private cloud infra

Red Hat private cloud object store

The rest of an enterprise’s apps

The rest of an enterprise’s apps

VMs VMs today -> containers tomorrow

Page 14: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

MULTI TENANT WORKLOAD ISOLATION With Shared Data Context

HDFS TMP

HADOOP

RED HAT CEPH STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

OPENSTACK VM

OPENSHIFT CONTAINER

2

3HDFS TMP

SPARK

HDFS TMP

SPARK/PRESTO

HDFS TMP

S3A S3A

BAREMETALRHEL

S3A/S3

Page 15: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED15

COMMON ARCHITECTURAL MODEL -PUBLIC OR PRIVATE CLOUD

PUBLIC CLOUD (AWS) PRIVATE CLOUD (RHT)

AWS EC2 PROVISIONING

RED HAT® OPENSTACK PLATFORMPROVISIONING

AWS S3SHARED DATASETS

RED HAT® CEPH S3SHARED DATASETS

Hadoop

Presto

Spark Hadoop

Presto

Spark

Page 16: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED16

FEATURES AND BENEFITS

MULTIPLE ANALYTIC CLUSTERS• Enable teams to meet their individual SLAs without competing for resources.

SHARED DATA SETS• Eliminate duplicate storage costs for multiple HDFS cluster silos.• Eliminate OpEx costs and complexity for maintaining multiple copies of datasets for multiple HDFS cluster silos.

FAST PROVISIONING OF ANALYTIC CLUSTERS• Unlocks Agility• Enables Speed to Capability

Page 17: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

ADVANCE ANALYTICS on CEPH

Page 18: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED18

MODERN BIG DATA ANALYTICS PIPELINESimplified Example

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE,JOIN

DATAANALYTICS

Page 19: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED19

MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE, JOIN

DATAANALYTICS

• Sensors• Click-stream• Transactions• Call-detail records

• NiFi• Kafka • Presto

• Impala• SparkSQL

• TensorFlow

• Kafka • Hadoop• Spark

• Spark• Hadoop

Page 20: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED20

TESTED WITH CEPH OBJECT STORE

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE, JOIN

DATAANALYTICS

• TPC-DS data sets(structured)• logsynth(semi-structured)

• bulk load• MapReduce • Impala

• Presto• (not tested)

• SparkSQL• Hive/MapReduce

• SparkSQL• Hive/MapReduce

• (not tested)

Page 21: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED21

TYPICAL SHARED DATA LAKE PROJECT STAGES

IDENTIFY• Potential fit?

QUALIFY• 1-2 day workshop• ID questions needing evidence• Prioritize questions by value• Design POC architecture

POC OR PILOT• Answer questions• Empirical results• RHT Solution Engineering• RHT Consulting

DEPLOYMENT• Phased roll-out• Red Hat Consulting

Page 22: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

SUMMARY AND NEXT STEPS

Page 23: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED23

KEY TAKEAWAYS

MISSED SLAsLarge Spark/Hadoop shops suffering from missedSLAs due to cluster congestion.

EXCESSIVE CAPEX AND OPEXdue to multi-clustersolutions without shared data.

Do you do big data analytics on-premises?

Do you have multi-PB data sets?

Do you have multiple Spark/Hadoop clusters?

Do these Spark/Hadoop clusters need to share data sets?

Do you also have non Spark/Hadoop tools that need access to these data sets?

PROBLEMS HOW YOU KNOW IT’S YOU

Page 24: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL

ONE CUSTOMER’S UNSOLICITED TESTIMONY“We managed to deliver tremendous value to our organization”:

● Releasing lock on data: moving the HDFS to an open access object store and opening the data process to more processes and analysis.

● Releasing lock on compute: now we’re able to spin up and decommission compute power according to customer needs and utilize cloud benefits (including GPU incorporation in zero time and effort), without worrying about the data.

● Releasing lock on innovation: we can now allow anyone to try and build something new without the fear of messing things up (data or cluster wise). We’ve built an environment that can tolerant mistakes at all levels (process and data), and by doing so, our developers can be much more daring.“

Page 25: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL

CUSTOMER SATISFACTION

“I’m delighted to announce that its been a few weeks since we’ve launched our Cloudoop* offering to our customers, and it’s a huge success. The responses from our customers are very, very positive, and I’m quoting “Big big like!!!”

This shift from the traditional approach is revolutionizing the way we consume and process our data.”

---- Head of Cloud Infrastructure, government agency(*Cloudoop is their Spark-as-a-service offering with an S3 backend, Spark by Cloudera and an S3 by Ceph)

Page 26: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

INSERT DESIGNATOR, IF NEEDED

RESOURCESSummary-level blogs:

● Breaking down data silos with Red Hat infrastructure

● Why would companies do this?

● Will mainstream analytics jobs run directly against a Ceph object store?

● How much slower will they run than natively on HDFS?

Architect-level blogs:● What about locality?● Anatomy of the S3A filesystem client● To the cloud!● Storing tables in Ceph object storage● Comparing with HDFS—TestDFSIO● Comparing with remote HDFS—Hive

Testbench (SparkSQL)● Comparing with local HDFS—Hive

Testbench (SparkSQL)● Comparing with remote HDFS—Hive

Testbench (Impala)● AI and machine learning workloads

Page 27: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

27

SOCIAL MEDIA OPTIONS

Page 28: Managing Data Analytics in a Hybrid Cloud · Managing Data Analytics in a Hybrid Cloud Karan Singh Sr. Solution Architect Storage & Hyper-Converged Business Unit Daniel Gilfix Technical

THANK YOU