Tame that Beast

Preview:

Citation preview

1© Copyright 2016 EMC Corporation. All rights reserved.

TAME THAT BEASTStefan RadtkeCTO, EMEAEMC Emerging Technology Division

2EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Welcome !Dr. Stefan RadtkeCTO Isilon, EMEAEMC Emerging Technology Division

- 1995-2011: 17 Years for IBM in various technical roles- 2011: Joined EMC- 2012-today: CTO, EMEA for EMC Insilon

Phone: +49-176-34434460E-Mail: Stefan.Radtke@emc.comLinkedin: http://de.linkedin.com/in/drstefanradtkeBlog: http://stefanradtke.blogspot.com

3EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

System AvailabilityUptime Downtime (per year)99.999% (AKA 5 nines) 5.26 minutes99.99% (AKA 4 nines) 52.6 minutes99.5% 1.83 days99% (AKA 2 nines) 7.30 days95% 18.25 days

What is your Data Warehouses’ uptime SLA?What is your Hadoop uptime SLA?

Why are they different?

5EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Hadoop takes on DB like Features• Newly Added Features in Hadoop 3.0

– Erasure Coding (HDFS-EC / HDFS-7485) is being introduced to Hadoop

– Additional Stand By Name Nodes for increase resiliency (HDFS-6440)

• Future Features– Random read support from Indexed Name Node – (

HDFS-8555)– Disaster Recovery (HDFS-5442)

6EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

So...• IF Hadoop is the Modern DatabaseAND

• IF Hadoop is taking on more Modern Database FeaturesAND

• Successful Outcomes are becoming more prolific...

Why are Operations of Hadoop and Uptime / SLAs seem like such an afterthought on most clusters?

7EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

KPIs• Why do companies who have VERY successful Data

Warehouses, ETL processes, and KPI Dashboards have so little of THOSE for their Hadoop instance which is now generating all their Machine Learning and Data & Analytics?

9EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

An Intervention• Why is the concept of 99.99% seem bad for a

production Hadoop system?• Why is solid KPIs around data collection and capture

sound absurd?• Since when did a backup copy or backup of your

primary analytics data become not needed?• Is this just because Hadoop is about standing up cheap

hardware?• Why do companies need a catalyst before these things

seem common again?

10EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Why wouldn’t you want:

• Two clusters fully addressable with data replication located in separate geographies

• Data Re-silvering when additional capacity is added

• Complete fault tolerance in the environment and not just Data / Node redundancy to allow 4 Nines availability

• Operational scale that allows 24 x 7 support

EMPT

YEM

PTY

EMPT

YEM

PTY

EMPT

YFU

LLFU

LLFU

LLFU

LLBA

LANC

EDBA

LANC

EDBA

LANC

EDBA

LANC

EDBA

LANC

ED

11EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 1• Separation of compute and storage.

– Why do you think the cloud Hadoop is able to offer better SLAs then on premise Hadoop? It isn’t because of a ton of single point of failure compute boxes. They separate compute and storage.

• Look at Infrastructure / Big Data as a service centralization– Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize

the team and provide QoS back to the applications

12EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Data Gravity• Data sets get bigger over time, and moving them becomes

increasingly difficult– This leads to switching costs & lock in

• Data is a strategic asset to enterprises with digital strategies• Data becomes central – build around it

– Applications tend to migrate toward the data– Apply advanced analytics to the data “in-place”

13EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Multiple Hadoop Silos

Storage Silos

vServer

Applications

Finance Marketing Operations Sales

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

CRMERP SCM CRM Servers

Storage

Servers

Storage

Servers

Storage

Analytics

Copy

Copy

Traditional IT

14EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

THE PROBLEM OF DATA MOVEMENT

• To get statistically relevant results, a typical minimal required data set is about 100 TB.

• That’s also the recommendet minimal Hadoop cluster size

• To copy 100TB over a dedicated 10 GBE link takes about 24 hours.

You need a Data Lake that unserstands Posix/Windows and HDFS to avoid data movement (=In-place Analytics)

15EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

EMC DATA LAKE

Isilon

Servers

Applications

Finance Marketing Operations Sales

Servers Servers Servers Servers

CRMERP SCM CRM

Servers Servers Servers

Analytics + Mobile Applications

• Data Lake

Servers Servers Servers Servers

16EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

WHAT IS A DATA LAKE?A Data Lake is scale-out storage for data consolidation. It allows for Big Data accessibility via traditional and next generation access methods to enable in-place analytics .

17EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Isilon Data Lake Architecture

ClientsC

LAN

CClients

Clients

Isilon Node

GB/10GBEthernet

Isilon

SAS

Isilon Node

SAS

Isilon Node

SASInfiniband

Scale out Data Lake OneFS integrates RAID, Volume Manager and

Filesystem. Uses internal disk and spawns a single

filesystem accross disks Development start in the 2000‘s Extremly mature, based on FreeBSD Supports many access protocols

Scale Out

ClientsClients

LAN

18EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

• Multi-threaded daemon runs on all nodes– Services both NN and DN protocols– Translates HDFS RPCs to POSIX system calls– Stateless, underlying FS handles coherency

HDFS Implementation as a Protocol

OneFS Node

isi_hdfs_d

ThreadRequest VFS

OneFSSyscall

Response

19EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HDFS IMPLEMENTED LIKE A NAS PROTOCOL

OneFS runs a daemon that speaks NameNode and DataNode natively

OneFS Clustered FileSystemOneFS Node

NameNodeDataNode

OneFS Node

NameNodeDataNode

OneFS Node

NameNodeDataNode OneFS

Node

NameNodeDataNode

Hadoop Node

DFSClient1) Request(“/file”)

2) Response (block locations) 3) GetBlock(block)

20EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

ISILON - FOR ALL TYPES OF UNSTRUCTURED DATA

Archive &Backup Target

File shares Home

Directories

BLOBS

Design, Test & Manufacture Retail &

Monetization

Transaction

Hadoop & Analytics

Sync ‘n Share

Application Test

Content

Social &Next-Gen

Surveillance

Isilon Data Lake

© Copyright 2016 EMC Corporation. All rights reserved.

21EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HDFS

SMB, NFS, HTTP, FTP, HDFS 1.x

...HDFS 2.x

...name node

name node

name node

name node data node

NFS

SMB

SMB

NFS MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

SUPPORT FOR MULTIPLE ANALYTICS APPLICATIONS

22EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY© Copyright 2015 EMC Corporation. All rights reserved.

DATA CENTER

CLOUDPOOLS

SmartPools Policy Example

<30 days

>30 days

S210

NL410

>2 years Cloud

22

EXPAND DATA LAKE TO THE CLOUD

30 days-1 year

> 1 year HD400

CLOUD PROVIDER

1 year – 2 years

23EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

CLOUDPOOLSDATA CENTER

23

CLOUD PROVIDER

APPS &USERS

Access time

CLOUD ENABLED DATA LAKE

24EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Parallel Replication Designed ground-up for scale-out storage Aggregate throughput scales with capacity Maintain consistent RPO over growing data sets Underlying FS knowledge

– Snapshot integration– Block-level deltas– Rich meta-data transfer

Automated Data Failover/Failback

25EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Storage ConsiderationsSTANDARD HADOOP CLUSTER

HADOOP USING EMC ISILON DATA LAKE

100 Nodes Compute + DAS24 TB per Node

/3 for Hadoop Copies

800TB Usable, but rarely achieved

5+ Cabinets

Spill space for ingestion and extraction

20 NodesCompute + 800TB Isilon

Single Copy withErasure Coding

800TB Usable

1 Cabinet It is NAS

26EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 2• Build a fully functioning cost model that includes all

items you think are “free”, but costs stop when you change the Architecture.

– Project based funding is great until you want to centralize. Centralization models (BDaaS) work when you consider all the sundry costs typically excluded by project based funding (i.e., 24 x 7 support for each cluster, all in costs that appear free but are sunk)

27EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 3• Think about “build all yourself” vs. “buy” • Focus on Analytics rather than infrastructure implementation,

software dependency, testing,.... etc.• That has all been done already with EMC Big Data Systems and

Big Data Solutions• Using pre-validated, installed and tested solutions reduces

complexity and increases reliability.

28EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

EMC BIG DATA PORTFOLIO

• Data Lake• Data Lake Extensions• Cloud Enabled

• Vblock• VxRack• VxRail

• Federation Business Data Lake

29EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HIGH PERFORMANCEPREDICTABLE, LOW LATENCY

HDFS

Filesystem

Buffer Cache

Device Driver

SATA Controller

Disk

HDFS

Filesystem

Buffer Cache

Device Driver

PCIe SSD

PCIe

SATA

PCIe

10 ms HDD

1000- 2000 µs HDD

Traditional PCIe SSD

Hadoop

Kernel

Motherboard

HDFS

PCIe

< 100 µs

DSSD

✓HDFS

Filesystem

Buffer Cache

Device Driver

SATA Controller

Disk

HDFS

Filesystem

Buffer Cache

Device Driver

PCIe SSD

PCIe

SATA

PCIe

10 ms HDD

1000- 2000 µs SDD

Traditional PCIe SSD

Hadoop

Kernel

Motherboard

DSSD Hadoop Plugin accesses

flash directly• 10X Throughput• 1/13th Latency• No Application

Changes Required

30EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

P I V O T A L B I G D A T A S U I T E

V M W A R E V C L O U D S U I T E

EMC DATA LAKE FOUNDATION: ISILON + ECSVCE VBLOCK | XTREMIO | DATA DOMAIN

O P E NA N A L Y T I C S T O O L B O X

D A T A A N D A N A L Y T I C S C A T A L O G

A D V A N C E D A N A L Y T I C SA P P L I C A T I O N SA T S C A L E

D A T A P R O C E S S I N G

GREENPLUMDATABASE HAWQ

SPRING XD PIVOTAL HDSPARK

REDIS

RABBITMQ

GEMFIRE

BDS ON PIVOTAL CLOUD FOUNDRY

H A D O O P

PL

AT

FO

RM

MA

NA

GE

R DA

TA G

OV

ER

NO

RDA

TA M

ANAG

ERIN

GEST

M

ANAG

ERAN

ALYT

ICS

MAN

AGER

EMC Business Data Lake

Look Demos at http://www.fbdldemo.com/

31EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Thursday, April 14th, 15:00 UTCWatch out for : • Hadoop Everywhere: Geo-Distributed Storage

for Big Data

Pesenters:• Nikhil Joshi, EMC• Vishrut Shah,EMC

33EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

A Remark on data locality• U. C. Berkeley’s AMP Labs declared Data locality

dead in 2011• Cloudera has declared data locality dead in

Hadoop 3.0 with HDFS-EC.• Gartner has declared hadoop dead due to its limits• Hadoop will only grow and have more dependency on

it going forward.• A catalyst may be the next time I see you and uptime

for hadoop is your main concern.

34EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Simple to manage Single file system, single volume, global namespace

Massively scalable Scales from 16 TB to over 50 PB in a single cluster200GB/s throughput, 3.75M IOPS

Unmatched efficiencyOver 80% storage utilization, automated tiering and SmartDedupe

Enterprise data protectionEfficient backup and disaster recovery, and N+1 thru N+4 redundancy

Robust security and compliance optionsRBAC, Access Zones, WORM data security, File System AuditingData At Rest Encryption with SEDs, STIG hardeningCAC/PIV Smartcard authentication, FIPS OpenSSL support

Operational flexibilityMulti-protocol support including NFS, SMB, HTTP, FTP and HDFSObject and Cloud computing including OpenStack Swift

Isilon Scale-Out NAS

35EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Geo-ScaleGeo-Replicated and Distributed to multiple locations

Massively scalable Scales to billions of objects in a single namespace

Support for all file sizesSupport for individual files of any size.

Multi-TenantEfficient backup and disaster recovery, and N+1 thru N+4 redundancy

HDFS CompatibleHortonworks Certified HDFS Compatible File SystemSwift CompatibleNatively support Open Stack storageNative Cloud InterfaceNatively works with existing cloud protocols like S3 and Azure.

Elastic Cloud Storage (ECS)