35
1 © Copyright 2016 EMC Corporation. All rights reserved. TAME THAT BEAST Stefan Radtke CTO, EMEA EMC Emerging Technology Division

Tame that Beast

Embed Size (px)

Citation preview

Page 1: Tame that Beast

1© Copyright 2016 EMC Corporation. All rights reserved.

TAME THAT BEASTStefan RadtkeCTO, EMEAEMC Emerging Technology Division

Page 2: Tame that Beast

2EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Welcome !Dr. Stefan RadtkeCTO Isilon, EMEAEMC Emerging Technology Division

- 1995-2011: 17 Years for IBM in various technical roles- 2011: Joined EMC- 2012-today: CTO, EMEA for EMC Insilon

Phone: +49-176-34434460E-Mail: [email protected]: http://de.linkedin.com/in/drstefanradtkeBlog: http://stefanradtke.blogspot.com

Page 3: Tame that Beast

3EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

System AvailabilityUptime Downtime (per year)99.999% (AKA 5 nines) 5.26 minutes99.99% (AKA 4 nines) 52.6 minutes99.5% 1.83 days99% (AKA 2 nines) 7.30 days95% 18.25 days

What is your Data Warehouses’ uptime SLA?What is your Hadoop uptime SLA?

Why are they different?

Page 5: Tame that Beast

5EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Hadoop takes on DB like Features• Newly Added Features in Hadoop 3.0

– Erasure Coding (HDFS-EC / HDFS-7485) is being introduced to Hadoop

– Additional Stand By Name Nodes for increase resiliency (HDFS-6440)

• Future Features– Random read support from Indexed Name Node – (

HDFS-8555)– Disaster Recovery (HDFS-5442)

Page 6: Tame that Beast

6EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

So...• IF Hadoop is the Modern DatabaseAND

• IF Hadoop is taking on more Modern Database FeaturesAND

• Successful Outcomes are becoming more prolific...

Why are Operations of Hadoop and Uptime / SLAs seem like such an afterthought on most clusters?

Page 7: Tame that Beast

7EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

KPIs• Why do companies who have VERY successful Data

Warehouses, ETL processes, and KPI Dashboards have so little of THOSE for their Hadoop instance which is now generating all their Machine Learning and Data & Analytics?

Page 9: Tame that Beast

9EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

An Intervention• Why is the concept of 99.99% seem bad for a

production Hadoop system?• Why is solid KPIs around data collection and capture

sound absurd?• Since when did a backup copy or backup of your

primary analytics data become not needed?• Is this just because Hadoop is about standing up cheap

hardware?• Why do companies need a catalyst before these things

seem common again?

Page 10: Tame that Beast

10EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Why wouldn’t you want:

• Two clusters fully addressable with data replication located in separate geographies

• Data Re-silvering when additional capacity is added

• Complete fault tolerance in the environment and not just Data / Node redundancy to allow 4 Nines availability

• Operational scale that allows 24 x 7 support

EMPT

YEM

PTY

EMPT

YEM

PTY

EMPT

YFU

LLFU

LLFU

LLFU

LLBA

LANC

EDBA

LANC

EDBA

LANC

EDBA

LANC

EDBA

LANC

ED

Page 11: Tame that Beast

11EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 1• Separation of compute and storage.

– Why do you think the cloud Hadoop is able to offer better SLAs then on premise Hadoop? It isn’t because of a ton of single point of failure compute boxes. They separate compute and storage.

• Look at Infrastructure / Big Data as a service centralization– Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize

the team and provide QoS back to the applications

Page 12: Tame that Beast

12EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Data Gravity• Data sets get bigger over time, and moving them becomes

increasingly difficult– This leads to switching costs & lock in

• Data is a strategic asset to enterprises with digital strategies• Data becomes central – build around it

– Applications tend to migrate toward the data– Apply advanced analytics to the data “in-place”

Page 13: Tame that Beast

13EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

Multiple Hadoop Silos

Storage Silos

vServer

Applications

Finance Marketing Operations Sales

Servers

Storage

Servers

Storage

Servers

Storage

Servers

Storage

CRMERP SCM CRM Servers

Storage

Servers

Storage

Servers

Storage

Analytics

Copy

Copy

Traditional IT

Page 14: Tame that Beast

14EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

THE PROBLEM OF DATA MOVEMENT

• To get statistically relevant results, a typical minimal required data set is about 100 TB.

• That’s also the recommendet minimal Hadoop cluster size

• To copy 100TB over a dedicated 10 GBE link takes about 24 hours.

You need a Data Lake that unserstands Posix/Windows and HDFS to avoid data movement (=In-place Analytics)

Page 15: Tame that Beast

15EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

EMC DATA LAKE

Isilon

Servers

Applications

Finance Marketing Operations Sales

Servers Servers Servers Servers

CRMERP SCM CRM

Servers Servers Servers

Analytics + Mobile Applications

• Data Lake

Servers Servers Servers Servers

Page 16: Tame that Beast

16EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

WHAT IS A DATA LAKE?A Data Lake is scale-out storage for data consolidation. It allows for Big Data accessibility via traditional and next generation access methods to enable in-place analytics .

Page 17: Tame that Beast

17EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Isilon Data Lake Architecture

ClientsC

LAN

CClients

Clients

Isilon Node

GB/10GBEthernet

Isilon

SAS

Isilon Node

SAS

Isilon Node

SASInfiniband

Scale out Data Lake OneFS integrates RAID, Volume Manager and

Filesystem. Uses internal disk and spawns a single

filesystem accross disks Development start in the 2000‘s Extremly mature, based on FreeBSD Supports many access protocols

Scale Out

ClientsClients

LAN

Page 18: Tame that Beast

18EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

• Multi-threaded daemon runs on all nodes– Services both NN and DN protocols– Translates HDFS RPCs to POSIX system calls– Stateless, underlying FS handles coherency

HDFS Implementation as a Protocol

OneFS Node

isi_hdfs_d

ThreadRequest VFS

OneFSSyscall

Response

Page 19: Tame that Beast

19EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HDFS IMPLEMENTED LIKE A NAS PROTOCOL

OneFS runs a daemon that speaks NameNode and DataNode natively

OneFS Clustered FileSystemOneFS Node

NameNodeDataNode

OneFS Node

NameNodeDataNode

OneFS Node

NameNodeDataNode OneFS

Node

NameNodeDataNode

Hadoop Node

DFSClient1) Request(“/file”)

2) Response (block locations) 3) GetBlock(block)

Page 20: Tame that Beast

20EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

ISILON - FOR ALL TYPES OF UNSTRUCTURED DATA

Archive &Backup Target

File shares Home

Directories

BLOBS

Design, Test & Manufacture Retail &

Monetization

Transaction

Hadoop & Analytics

Sync ‘n Share

Application Test

Content

Social &Next-Gen

Surveillance

Isilon Data Lake

© Copyright 2016 EMC Corporation. All rights reserved.

Page 21: Tame that Beast

21EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HDFS

SMB, NFS, HTTP, FTP, HDFS 1.x

...HDFS 2.x

...name node

name node

name node

name node data node

NFS

SMB

SMB

NFS MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

MAP Reduce

SUPPORT FOR MULTIPLE ANALYTICS APPLICATIONS

Page 22: Tame that Beast

22EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY© Copyright 2015 EMC Corporation. All rights reserved.

DATA CENTER

CLOUDPOOLS

SmartPools Policy Example

<30 days

>30 days

S210

NL410

>2 years Cloud

22

EXPAND DATA LAKE TO THE CLOUD

30 days-1 year

> 1 year HD400

CLOUD PROVIDER

1 year – 2 years

Page 23: Tame that Beast

23EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

CLOUDPOOLSDATA CENTER

23

CLOUD PROVIDER

APPS &USERS

Access time

CLOUD ENABLED DATA LAKE

Page 24: Tame that Beast

24EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Parallel Replication Designed ground-up for scale-out storage Aggregate throughput scales with capacity Maintain consistent RPO over growing data sets Underlying FS knowledge

– Snapshot integration– Block-level deltas– Rich meta-data transfer

Automated Data Failover/Failback

Page 25: Tame that Beast

25EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Storage ConsiderationsSTANDARD HADOOP CLUSTER

HADOOP USING EMC ISILON DATA LAKE

100 Nodes Compute + DAS24 TB per Node

/3 for Hadoop Copies

800TB Usable, but rarely achieved

5+ Cabinets

Spill space for ingestion and extraction

20 NodesCompute + 800TB Isilon

Single Copy withErasure Coding

800TB Usable

1 Cabinet It is NAS

Page 26: Tame that Beast

26EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 2• Build a fully functioning cost model that includes all

items you think are “free”, but costs stop when you change the Architecture.

– Project based funding is great until you want to centralize. Centralization models (BDaaS) work when you consider all the sundry costs typically excluded by project based funding (i.e., 24 x 7 support for each cluster, all in costs that appear free but are sunk)

Page 27: Tame that Beast

27EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

What is my Idea - 3• Think about “build all yourself” vs. “buy” • Focus on Analytics rather than infrastructure implementation,

software dependency, testing,.... etc.• That has all been done already with EMC Big Data Systems and

Big Data Solutions• Using pre-validated, installed and tested solutions reduces

complexity and increases reliability.

Page 28: Tame that Beast

28EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

EMC BIG DATA PORTFOLIO

• Data Lake• Data Lake Extensions• Cloud Enabled

• Vblock• VxRack• VxRail

• Federation Business Data Lake

Page 29: Tame that Beast

29EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

HIGH PERFORMANCEPREDICTABLE, LOW LATENCY

HDFS

Filesystem

Buffer Cache

Device Driver

SATA Controller

Disk

HDFS

Filesystem

Buffer Cache

Device Driver

PCIe SSD

PCIe

SATA

PCIe

10 ms HDD

1000- 2000 µs HDD

Traditional PCIe SSD

Hadoop

Kernel

Motherboard

HDFS

PCIe

< 100 µs

DSSD

✓HDFS

Filesystem

Buffer Cache

Device Driver

SATA Controller

Disk

HDFS

Filesystem

Buffer Cache

Device Driver

PCIe SSD

PCIe

SATA

PCIe

10 ms HDD

1000- 2000 µs SDD

Traditional PCIe SSD

Hadoop

Kernel

Motherboard

DSSD Hadoop Plugin accesses

flash directly• 10X Throughput• 1/13th Latency• No Application

Changes Required

Page 30: Tame that Beast

30EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

P I V O T A L B I G D A T A S U I T E

V M W A R E V C L O U D S U I T E

EMC DATA LAKE FOUNDATION: ISILON + ECSVCE VBLOCK | XTREMIO | DATA DOMAIN

O P E NA N A L Y T I C S T O O L B O X

D A T A A N D A N A L Y T I C S C A T A L O G

A D V A N C E D A N A L Y T I C SA P P L I C A T I O N SA T S C A L E

D A T A P R O C E S S I N G

GREENPLUMDATABASE HAWQ

SPRING XD PIVOTAL HDSPARK

REDIS

RABBITMQ

GEMFIRE

BDS ON PIVOTAL CLOUD FOUNDRY

H A D O O P

PL

AT

FO

RM

MA

NA

GE

R DA

TA G

OV

ER

NO

RDA

TA M

ANAG

ERIN

GEST

M

ANAG

ERAN

ALYT

ICS

MAN

AGER

EMC Business Data Lake

Look Demos at http://www.fbdldemo.com/

Page 31: Tame that Beast

31EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Thursday, April 14th, 15:00 UTCWatch out for : • Hadoop Everywhere: Geo-Distributed Storage

for Big Data

Pesenters:• Nikhil Joshi, EMC• Vishrut Shah,EMC

Page 32: Tame that Beast
Page 33: Tame that Beast

33EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

A Remark on data locality• U. C. Berkeley’s AMP Labs declared Data locality

dead in 2011• Cloudera has declared data locality dead in

Hadoop 3.0 with HDFS-EC.• Gartner has declared hadoop dead due to its limits• Hadoop will only grow and have more dependency on

it going forward.• A catalyst may be the next time I see you and uptime

for hadoop is your main concern.

Page 34: Tame that Beast

34EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Simple to manage Single file system, single volume, global namespace

Massively scalable Scales from 16 TB to over 50 PB in a single cluster200GB/s throughput, 3.75M IOPS

Unmatched efficiencyOver 80% storage utilization, automated tiering and SmartDedupe

Enterprise data protectionEfficient backup and disaster recovery, and N+1 thru N+4 redundancy

Robust security and compliance optionsRBAC, Access Zones, WORM data security, File System AuditingData At Rest Encryption with SEDs, STIG hardeningCAC/PIV Smartcard authentication, FIPS OpenSSL support

Operational flexibilityMulti-protocol support including NFS, SMB, HTTP, FTP and HDFSObject and Cloud computing including OpenStack Swift

Isilon Scale-Out NAS

Page 35: Tame that Beast

35EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY

Geo-ScaleGeo-Replicated and Distributed to multiple locations

Massively scalable Scales to billions of objects in a single namespace

Support for all file sizesSupport for individual files of any size.

Multi-TenantEfficient backup and disaster recovery, and N+1 thru N+4 redundancy

HDFS CompatibleHortonworks Certified HDFS Compatible File SystemSwift CompatibleNatively support Open Stack storageNative Cloud InterfaceNatively works with existing cloud protocols like S3 and Azure.

Elastic Cloud Storage (ECS)